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Preface 


The theory of models, especially methods of linear regression and the analysis 
of variance, play a central role in the statistical analysis of experimental data 
and in modelling causal relationships. This theory has been developed and 
treated in innumerable papers and monographs, and the book ‘Statistical 
Interference in Linear Models’ by the same editors, H. Bunke and O. Bunke, 
provides a comprehensive presentation. But, in many problems of statistical 
analysis and modelling these methods are not sufficient. Extended models as 
well as more general or different methods are needed. For instance, this is the 
case when only nonlinear regression functions give a sufficient description of 
causal relations or if the causes, the explanatory variables or regressors can 
only observed with errors. Sometimes also irregularities of the observation 
errors, like the existence of outliers, call for the use of so-called ‘robust’ me- 
thods. The treatment of such problems in general demands a considerably 
higher numerical and computational effort. But increased computational 
capabilities offered to the statistician new possibilities for dealing with such 
more complicated problems. From this emanated a vigorous impetus for 
further development of the statistical theory. A lot of theoretical and applied 
research on nonlinear regression analysis, on functional or structural relations 
and on nonparametric and robust estimations emerged in this connection. 
The authors hope to provide with this monograph a comprehensive and as 
unified as possible presentation of the state of the art in these fields. While 
single topics are treated in a series of books like those by M. G. Kendall and 
A. Stuart, HE. Malinvaud, F. Schmidt, D. A. Ratkowski and Y. Bard, and while 
there exist important survey papers for some fields, we are not aware of any 
comparable presentation of these fields. Essential parts of the book are cha- 
racterized by results of the authors’ own research, some of them unpublished 
until now. 

The book is addressed to statisticians in research, teaching, and applica- 
tions and to mathematicians who want to be informed on the fields mentioned 
above, i.e. on statistical inference, on parameters of nonlinear regression func- 
tions, on models with errors in the variables, or on robust methods for regression 
parameters. The reader should have a basic knowledge of probability theory 
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and mathematical statistics, especially of regression analysis. The book is 
designed in such a way that it can be used independently of the book ‘Statisti- 
cal Inference in Linear Models’. The different problems and results are pre- 
sented with mathematical vigour and in a systematic way, which is intended 
to be as unified as possible. Thereby the different chapters are coordinated on 
the one hand, but independently readable on the other. In order not to exceed 
the already considerable size of the book, some results are discussed without 
giving proofs. Similarly, as in ‘Statistical Inference in Linear Models’ the 
theorems and auxiliary results of linear algebra, probability theory and sta- 
tistics, which are needed in the proofs, are included in an appendix in order 
to minimize the length of the proofs and to concentrate on the specific aspects 
-of the considered fields. Some of the results from the appendix are new and 
have been derived exclusively for the solution of the problems investigated. 
In references to results on inference for linear models, the corresponding sec- 
tions or results from ‘Statistical Inference in Linear Models’ are mentioned. 

In the following description of the content we cannot, in view of the size 
of the book, give a complete account. We want to indicate the essential orien- 
tation of the chapters, new results, and some very recent results of the basic 
literature which have been included. Many of the results have been obtained 
by the authors, a fact which will not always be explicitly stated. 

Chapter 1 is devoted to the estimation of parameters of nonlinear regression 
functions and to the testing of corresponding hypotheses. At first the question 
is discussed of why the approximation by linear models or a transformation of 
nonlinear models to such models only sometimes gives a satisfactory solution. 
The main objects of the investigations are the weighted least squares estima- 
tion and the maximum likelihood estimation of regression parameters. Thereby 
an important extension of the usual approach hitherts is taken as a basis, name- 
ly avoiding the assumptions of adequacy of the regression function and of 
homogeneous variances. Thus the basic asymptotic properties of the weighted 
least squares estimator proved by &. J. Jennrich and H. Malinvaud, like con- 
sistency and asymptotic normality, are generalized under a model without 
assumption of a normal distribution, and an asymptotic analogue to the Gauss- 
Markov theorem is proved. Under the assumption of a normal distribution, 
stronger optimality properties (BAN property) are shown, which also charac- 
terize the normal distribution in the case of identical error distributions. 
Corresponding results are also derived for the maximum likelihood estimator 
under more general assumptions on the error distribution. A similar asymp- 
totic theory is also derived for the residual estimations of the variance. Asymp- 
totic tests for testing hypotheses on regression parameters based on the least 
squares estimator and on the likelihood ratio statistics are investigated and 
their asymptotic power under local alternatives is given. Confidence regions 
are surveyed. 

A separate extensive section deals with models with changes of state, for 
which different regression functions hold for certain subsets of observations. 
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Models with abrupt and with continuous changes of state are considered. For 
the estimation of the change points and of the regression parameters we in- 
vestigate suitable methods, especially least squares estimators, as well as their 
properties, such as consistency. We discuss in which way tests and methods of 
cluster analysis may be used for decisions on the presence of change points. 
The section gives a survey of the literature on models with changes of 
state. 

A further section is devoted to the asymptotic optimality of nonparametric 
estimators of regression functions. Optimality is considered in the local asymp- 
totic minimax sense connected with the work of Ibragimov and Khasminski. 
Their results together with those of Stone and other on lower bounds for the 
asymptotic minimax risk and on functions attaining these bounds are reviewed. 
Moreover the role of least squares splines and classical smoothing splines as 
asymptotically optimal estimators and the corresponding L,-risk convergence 
order is discussed including results of Agarwal, Studden, Cox and others. Exact 
constants for the optimal convergence and corresponding estimators connected. 
with interpolating splines are derived based on results of Pinsker on observa- 
tions following a linear stochastic differential equation. 

In Chapter 2 we develop the theory of robust methods for inference on linear 
parameters in linear models with independent identically and continuously 
distributed errors. Following an introductory discussion of robustness, the 
treatment is mainly on asymptotic properties of L-estimators, which are based 
on linear combinations of order statistics, of R-estimators, which are derived 
from rank tests or rank-dependent criteria, and of M-estimators, which are 
computed by generalizations of the least-squares criterion. For the special 
case of the location model a finite sample minimax property proved by P. 
Huber is shown for the M-estimators which is defined by Huber’s wy-function. 
Asymptotic normality is shown under certain regularity assumptions for M- 
estimators of linear parameters. According to a theorem by P. Huber, it turns 
out that Huber’s M-estimator has an asymptotic minimax property. This pro- 
perty also holds for the L- and R-estimators, which are asymptotically equi- 
valent to Huber’s M-estimator. For the asymptotic equivalence of M-, L-, and 
R-estimators we give equations between the corresponding weighting and 
score-generating functions. Numerical algorithms for the computation of M- 
estimators are extensively discussed following ideas of P. Huber and R. Dutter. 
As a basis for rank methods we first derive the known locally most powerful 
rank tests for the hypothesis of a vanishing regression part and for the hypo- 
thesis of the symmetry. “or the corresponding linear rank statistics and signed 
rank statistics, respectively, the asymptotic normality under the null hypo- 
thesis is shown following J. Hajek. 

Then we prove the uniform asymptotic linearity in the regression para- 
meters of the linear rank statistics and we give the thoerem of C. van Heden 
for the signed rank statistics. By means of this property we can obtain the 
asymptotic normality of R-estimators, which also allows the derivation of the 
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asymptotic efficiency under various distribution assumptions. The asymptotic 
normality of a linearized version of rank estimations is also studied along the 
lines of C. van Eeden. Since the asymptotically efficient R-estimators depend 
on the unknown density, three asymptotically efficient adaptive methods are 
introduced, which were investigated by J. Hajek, R. Beran, and C. van Eeden. 
Asymptotic confidence intervals for one-dimensional regression coefficients 
are constructed from rank tests. We show that the ratio of the lengths of the 
standard and the rank confidence intervals converges against the Pitman 
asymptotic relative efficiency of the standard and rank tests. A sequential 
confidence interval with given length, which has been given by F. J. Anscombe, 
J. Geertsema and M. Gosh and P. K. Sen and wihch is based on the Wilcoxon 
test, is also discussed. 

Chapter 3 gives an introduction to the topic of models with errors-in-varia- 
bles as well as a comprehensive presentation of the classical and recent results. 
After the discussion of simple examples from applications, we explain why the 
least squares estimators from regression models may be bad, if there are 
errors in the variables. 

General model formulations are exhaustively discussed. Within a survey on 
identifiability statements we give, among other things, the theorems by Rezersol 
and the result which states that the structural parameter is not consistently 
estimable in a model with nonrandom experimental design, if it is not identi- 
fiable in a corresponding model with random design. Maximum likelihood esti- 
mators are considered first of all for bivariate linear functional relations. Doing 
this, according to NV. R. Cox and G. R. Dolby, models with nonrandom as well 
as those with normally distributed random experimental design may be con- 
sidered simultaneously. Then we deal with multivariate models with nonrandom 
experimental design. The relation between maximum likelihood and least 
squares estimators is described. Under the assumption of independent mea- 
surement errors the maximum likelihood estimator may be obtained from an 
eigenvalue problem. Besides this known result we derive statements on equi- 
variance and uniqueness. For models with a covariance matrix, which is known 
up to an unknown factor, the coordinate-free approach allows a condensed 
unified presentation of some known results. The theorem by 7’. W. Anderson 
on the maximum likelihood estimator under normally distributed independent 
errors with unknown covariance, which has been overlooked for a long time, 
leads to the solution of an eigenvalue problem. For nonlinear models we outline 
possibilities for simplification on the basis of special assumptions on error 
covariance of the errors and identifiability properties. For nonlinear models 
with replications of a fixed experimental design we show, as in Chapter 1, 
the consistency, asymptotic normality, and optimality of the weighted least 
squares and the maximum likelihood estimator, respectively. 

The asymptotic normality is also shown for a modified Gauss-Newton ite- 
ration suggested by W. A. Fuller and K. M. Wolter. Explicit formulas and 
estimates for the asymptotic variances and covariances of the generalized 
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least squares estimators are given in the bivariate case. As alternatives to 
maximum likelihood estimators we consider, among others, instrumental 
variables estimators. The relations between known estimators for parameters 
on functional relations and in simultaneous equations of econometry are clari- 
fied following ideas of 7. W. Anderson and are connected with an approxima- 
tive power comparison of the estimators. In the same manner we compare the 
modified maximum likelihood estimators and the two-stage least squares 
estimators investigated by W. A. Fuller. For linear models in which measure- 
ment errors and design points are generated by time series we treat consistency, 
asymptotic nomality, and identifiability for an estimator, introduced by P. M. 
Robinson. 

A separate extensive section is devoted to a uniform asymptotic theory of 
linear models with nonrandom experimental design. Following the explanation 
of the parametrization and a collection of results on maximum likelihood 
estimators, we develop a general formulation of ‘canonical’ variables estima- 
tors, as a formal special case of which the maximum likelihood estimator 
arises. The consistency of such estimators is poved under certain assumptions, 
which are additionally interpreted. Some special cases known from the litera- 
ture are discussed. Now, the asymptotic efficiency of the maximum likelihood 
estimator cannot be proved because of the infinite-dimensional parameter 
space. Hence assuming a normal distribution, we prove a limit normal distri- 
bution and, in connection with this, the efficiency in a certain heuristically 
motivated class of estimators. This class contains besides the maximum likeli- 
hood estimator the most important alternative estimators investigated in the 
literature. Moreover, from a method of improvement there results an easily 
computable efficient estimator. Many known results on limit distributions 
and comparisons follow from the general theorems. 3 

Based on the results of T. W. Anderson we give a survey on tests and con- 
fidence regions for linear models. Finally we describe possibilities for the nu- 
merical calculation of weighted least squares estimators. For bivariate linear 
models with different but known covariances for each single design points we 
describe a method of J. H. Williamson. A Newton-Raphson type method is 
discussed following M. O’Neill and L. G. Sinclair for bivariate polynomial 
models. The special structure of the Gauss-Newton methods is discussed for 
general errors-in-variables models and other models. 


Helga Bunke Olaf Bunke 
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| 2 Nonlinear Regression 


Chapter 1 


Parameter estimation and testing hypotheses 
in nonlinear models 


While a powerful theory of ‘small’ samples has been worked out for linear 
regression models (compare Bunke and Bunke, 1986), in the nonlinear case 


| effective concepts of statistics fail because of the complicated structure of the 
| parameter space. Geometrical approaches as in the linear estimation theory, 
or the theory of exponential families in the normal model are no longer 
_ available. Heuristically motivated estimators such as the least squares esti- 


mator need iteration procedures. They are not unbiased and their bias and 
variance can only be determined approximately (cf. equations (1.1.9) and 
(1.1.10)). 

In order to avoid these difficulties the statistician will first of all try to 
approximate the nonlinear model by a linear one or to reach a linear model by 


_ an appropriate data transformation (Section 1.1.4). But, when approximating, 


parameter interpretations and typically nonlinear effects, which may just be 
of interest for the scientists, often get lost. With the data transformation, 


_ which is possible in exceptional cases only, the effect on the error structure 
_ has to be taken into account. 


A possibility of the theoretical treatment of the nonlinear regression model 
consists in establishing an asymptotic theory (‘large’ sample size). If the con- 


_ sistency of the parameter estimation has been shown (see Theorem 1.1.1), 


then the supposition is obvious that, for large sample size, the model can be 


_ approximately considered as linear, hence that the properties of this linear 


model are reflected, e.g. in limit distributions of the parameter estimation 


_ (see Theorem 1.1.2). 


If we drop the supposition of the adequacy of the regression model in these 
investigations, the asymptotic results provide important information about 


the robustness of the methods with respect to model errors. At the same time 
_ new starting points offer themselves for a theory of model choice. The analogy 


to the linear model, which results in the structure of the limit distributions of 


the estimation, leads to the following heuristic explanation for the established 
statistical methods. The methods (least squares estimators, tests statistics, 
etc.) are constructed based on the nonlinear structure of the model, but their 
assessment due to (asymptotic) properties of distribution (optimality, quan- 


o¥ 
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tiles, etc.) is based on linear approximations. The nonlinear character of a model 
will appear in the distribution properties only if higher order asymptotics is used. 

From this point of view it is not very surprising that there are a lot of ana- 
logies to the linear model when we assess the goodness of the nonlinear least 
squares estimators in virtue of the covariance structure of their limit distri- 
bution (Section 1.1.7). Without any suppositions on the normal distribution 
about the errors we can prove an asymptotic analogy to the Gauss-Markov 
theorem (Theorem 1.1.3). Under normal distribution, stronger optimality 
properties (BAN) can be proved, which also characterize the normal distri- 
bution in the case of an identical error distribution (Theorem 1.1.5). Corres- 
ponding investigations may also be carried out for the maximum likelihood 
estimator under general suppositions on the error distribution (see Theorem 
1.1.7). A similar asymptotic theory as for parameters of the regression function 
may be derived for the residual estimation of the variance (Section 1.1.8). 

For checking the hypothesis, two procedures offer themselves: on the one 
hand, we can, analogously to the linear model, construct tests on the basis of 
the limit distribution of the least squares estimator (cf. equation (1.1.9)). 
On the other hand, we can also verify the asymptotic distribution statements 
(y?) on likelihood quotient test statistics, which are known for the case of 
identically distributed observations, for the nonlinear regression model 
(Theorem 1.1.13). The evaluation of the power of the tests on local alternatives 
is made possible by means of the concept of the contiguous distributions. 

An important model type, which formally belongs to the nonlinear regression 
models, but which plays a special role because of its specific structure, is 
represented by models with changes of state (Section 1.2). A certain regression 
setup only holds for a subset of the observations, otherwise another one does 
so. The point (perhaps date) at which this transition between the different 
states of the system happens is an additional nonlinear parameter of special 
interest. Problems from many regions of Pee lead to such models 
(Section 1.2.1). 

Depending on the transition conditions for the regression functions in the 
transition points, we distinguish models with abrupt (Section 1.2.2) and con- 
tinuous (Section 1.2.3) changes of state. The difficulties resulting for the nu- 
merical calculation of the least squares estimator in such models suggest a 
skilful combination of test and estimation methods for the analysis of tran- 
sition points. This is the reason why the discussion of various tests for check- 
ing the state stability takes a comparatively great space. 

The consistency of the least squares estimator demands a modified consi- 
deration as the parameter space is generally not compact, but depends on the 
sample size, and the regression function does not continuously depend on the 
parameter (Section 1.2.4). 

Summarizing, we can assess that for models with changes of state, methods 
have to be developed in which the numerical implementability has to be 
considered from the very beginning. 
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1.1 Parameter estimation in nonlinear models 


Ne LT Introduction 


Let a relation between x and y be described by a function f: £ — IR!, which 
is called a regression function (cf. Bunke and Bunke, 1986, ch. 1). It is supposed 
that, with a fixed value x of the regressor, the value y of the regressand is 
random and has the expected value f(z). The available information on f is 
expressed by giving a set F = {g5 | 8 € O}, (O CR’) of functions gg: X > R}, 
which is known to contain the ‘true regression function’ f, i.e. there is a % € @ 
‘(true regression parameter’) with f = g»,. If gs is nonlinear in #, then we call 
it a nonlinear regression function. If we have observations y, of the regressand 
on values x; of the regressor, ¢ = 1, ...,, then we get the problems: 


1. Estimation of f or of the values of f on a subset 2, < &. 

2. Estimation of % and of derived parameters (i). 

3. Approximation of f by a function from a given set F¥ of functions that may 
be of a simple structure or may have other advantages. 

4. Confidence regions for %p. 

5. Testing of hypotheses on f or #. 


For the mathematical treatment of these problems we need appropriate suppo- 
sitions on the values x,, y;. 

The values x; of the regressor, which are called design points, are given by a 
vector € = (%,...,X,), the experimental design. Moreover, the equations 


Yt = f(a) + &, t= 1,2,...,” (1) 
hold, where ¢,, ..., ¢, are independent random variables with 
Ee, = 0 ands —._De,='0; « (2) 


The variances o7 are unknown, and we can make any suppositions of the form 
o” := (64, ..., 6,) € &. Later on we will need such suppositions for the whole 
sequence of the o;, ¢ = 1, 2,..., in the section on asymptotic behaviour. 

For the approximation of the expected value we establish a parametric 
function class {95 | ® € O}, where gs : % — IR! is a given function for each fixed 
@ € O. As usual let us suppose that 


[UE IY ane (3) 


then we say that the model is adequate (f = g»,). Later on we will drop this 
assumption sometimes. In the asymptotic considerations we also permit the 
functions gy to depend on n: gf”. 

According to the above suppositions y, are independent random quantities 


the distribution of which is determined by the parameter ¢ := (f, x), where x 


22 Chapter 1. Parameter estimation and testing hypotheses in nonlinear models 


denotes the distribution of ¢ := (é, ..., €n)’. By giving a set XH of distributions 
with x € KH, the preliminary knowledge of x can be characterized. 

Following the notation of Bunke and Bunke (1986), for an arbitrary function 
g: X — RR! we will denote the vector (g(a); es g(&n))’ by g*. With 


Ad, = (Jails ar Galt,)) |e Oe, 
= {Diag [o?, ..., o2] | of € F} 

the set 

[F* X Vg = (We |o = (f,%) © F XK RY (4) 
of possible distributions W, of the random vector which is bounded by (1), 
(2), (3) is an adequate distribution model, which we call a nonlinear model, or, 
in the special case of F* being a linear subspace of the IR", a linear model. 
y © [F* XK Vy is a way of writing that there existsa W, € [F* XK UV] with 
y ~ W;. In the literature also the short form 

Ey « F*, Dy€eV 
or also 

y=r+e, me F®, He = 0, Deev 
is to be found if # is the set of all distributions with (2). 

Surveys on statistical methods in nonlinear regression problems and related 


problems were given, e.g. in Bard (1974), Cox (1977) and in Bunke, Henschke, 
Striiby, and Wisotzki (1977). 


1.1.2 Examples 

For illustrative purposes we first of all give some typical examples of regression 
models. 

(a) Linear regression model 

If the regression function g»(~) = g(a) is linear in #, then (1) and (2) are called 
a linear regression model (cf. Bunke and Bunke, 1986). 

(b) Empirical growth curves 


Empirical growth curves occur in biological or chemical problems where a 
quantity u, at time ¢ is observed. If we suppose that the speed of growth 
du,/dé is proportional to the quantity just achieved and to the difference 
between a maximally achievable quantity 6 and u,, then the differential equa- 
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tion 
= au,(b — Ut) 


holds. Integration, taking the logarithm, reparametrization, and assumption 
of an additive error structure lead to the model 


yi =o —In(1 +e") + «; 
= go(t) a3 Et» oe = (x, A, B). (5) 


This model was investigated by various authors, e.g. Nelder (1961, 1962). 


(c) Exponential models 


For many applications, mixtures of exponential functions are appropriate: 


P B= (Oo, Xs +++» Hps Bis +++» Bp) 

Go(X) = Xm + Di ast . 
s=1 c= (1), OO) Xp) » 

ai) € IR’. 


For special cases specific methods of estimation have been developed (e.g. 
McGilchrist, 1968; Rasch, 1967; Agha, 1971; Saleh and Choudry, 1975). Models 
of the form (6) are also investigated under the name of ‘multicompartment 
systems’. 


(d) Cobb-Douglas models 


In economic problems, e.g. in the investigation of production or demand func- 
tions, regression functions of the following type occur: 


Gea) = ot OS; P= Ak. Biss.) - 


Ce (ta), on %(p)) 
Such regression functions are called Cobb-Douglas functions. Sometimes a 
multiplicative model is assumed in order to assure an additive error structure 
again after taking the logarithm: 


Yi = Jo(X) & > Es, = 1, Dee. 


A detailed discussion is given, e.g. in Goldberger (1968) and Goldfeld and Quandt 
(1972). 
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1.1.3 Least squares estimation 


Estimating linear parameters y = C8, a restriction to linear estimations 
4 = Ly makes sense in linear regression models. This is not true for nonlinear 
models since the domain of values of y is not a linear space in general. Suppose 
we want to estimate y = g4 := (95(21), BAS Go(an))’, then it seems to be rea- 
sonable to use estimation functions 7 with values in F* := {g', | 8 € O}. Such 
estimation functions are, e.g. projections gi. » of y on F*, where we can suppose 
e.g. a norm of the form 


m 1/2 
"ly\n = jot S woae » (w>0;¢=1,...,2) 
=1 
such that 
“ly — Foanla = ne ie Foln ae) 
holds. 

By choosing the weights w = {w;,¢ = 1, 2,...} differently, we obtain dif- 
ferent estimators. As we allowed in the heteroscedastic case, it suggests itself 
to take the inverse variances as weightings. If these are not known, we can 
insert estimators instead and thus construct two-stage estimators. This is 
also the reason why we allow the weights w, to be positive random quantities, 


which may depend on the sample size n, too, in the asymptotic investigations 
(oo%”), 


Definition 1.1.1 Let a 3) € O with f = gp», exist. An estimator } of y(A) ts 
called a weighted least squares estimator (WLSE) if >(y) = y(d(y)) and if the 
estimator & is a solution of (8). With w,= 1 (t =1,..., n) (w = wy), > is called 
an ordinary least squares estimator (OLSE). With w,= 0,7 (¢ =1,...,n) 
(w = w,), p ts called a generalized least squares estimator (GLSE). 


If we speak of least squares estimators (LSE) or their generalizations in the 
following, we always assume that there exists a solution d(y) of (8) and that 
it can be represented as a measurable function in y. 

If gy has for example a representation 


go(t) = a'hg(x) (8 = (x, 8)€ O=R°XF, aE L) 


with hg: X — R?, where # is a compact subset of a Euclidean space, and if 
h,(x) is continuous in f for each fixed x € #, then the existence of a measurable 
d(y) can easily be proved. 

In case ¢, ~ N(0, 07), = 1,..., 2 is valid, the GLSE obviously coincides 
with the maximum likelihood estimation function (MLE). In general, a WLSE 
can not be explicitly computed, but iterative numerical methods have to be 
applied. The WLSE is in general not unbiased and an exact computation of 
the bias is not possible. 
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Box (1971) derived a rough approximate formula for the bias of the OLSE as 


&, ~ N(0, 1) (CS Ales. 2) (9) 
by approximating y —g3 by a ore expansion of second order by #% and 
by taking & — 9 w Ae + (e’Bye,..., &'B .€)’. This yields: 

x o2 n -1 7 n nary 
E(a — H) & ary |S | > F; tr {| Ss} Fn a} (10) 
a j=1 i=1 I=1 
ee ee aide atin Lape Si 


F, = Fi) = eee 


88: \a2%, i, (00, 
A, = a ge(e) a) 
(8, CP 


Wes unre 
o=0 SIRS 


There are various reasons to drop the supposition (3): f € F. On the one hand 
it simply can not be known whether the model with ¥ is adequate, hence 
whether (3) holds. Regression models are often applied although there is 
little knowledge about the true structure of the dependences. The problem of 
the model choice is complicated in the nonlinear case, and it is not advisable 
to assume adequacy from the very beginning. Furthermore, the proper problem 
may consist in approximating the regression function in a given functions class 
F. If we do not aim at the adequacy of the model, we may, perhaps simplify 
the optimization problem (8) by an appropriate choice of J. 

In order to find the connection with the later asymptotic considerations we 
still want to generalize (8). Let F, = {g{" | @ € O} be given function classes 
(% — IR!), and w = {w}")|¢=1,...,”} sequences of positive random va- 
riables. Let 


and 


9€0 (11) 
Q,(8) -= “ly — [99 F ln 


Definition 1.1.2 f,, := Giy is called the weighted inadequate least squares 
approximation (WILSA) of f if & fulfils (11). The respective parameter esti- 
mator & we call WILSE. 


Apart from the heuristic sense of the optimization problems (8) and (11) we 
relate the idea with them that, with a growing number of observations, WILSE 
» converges against y and that WILSA better approximates the projection of 
fj on F,, in the sense of the seminorm |f| := “|f*|,. These problems will be in- 
vestigated in Section 1.1.5. 

The proof of the good approximation properties of the WILSA is of funda- 
mental importance for nonlinear regression models. The assumption of a 
certain function class F almost always represents an approximation in appli- 


26 Chapter 1. Parameter estimation and testing hypotheses in nonlinear models 


cations, i.e. the assumption of the adequacy of the model (‘there is a &% with 
go, =f?) is violated and at most approximately guaranteed (‘there is a 
with gs, ~ f?). Good properties of the WILSA correspond to the stability 
properties of the least squares method in such inaccuracies of the model. 

If some of the components of # linearly enter into gy, 1.e. if 


go(x) = a'hg(a) (8 = (a, f)€ O= R°XB, xe ZL) (12) 


with hg: X > IR?, then the dimension of the nonlinear optimization problem 
(8) may be reduced (see also Lawton and Sylvestre, 1971; Barham and Drane, 
1972). With 


P, = H,H?, H3 = (H,D Hs) B,D, 
Hy = ((havled) 
(hei) denotes the ith component of hs) 
D,, = Diag [w,, «.-, @,] 
(8) transforms to 
"ly — Oslin = min *|y — Peyla = “ly — Paw ln> 
peB (13) 
b= (4,8), &@= Hey. 
This fact is illustrated by the following simple example. 


Example 1.1.1 Let 


fo(x) = a4 + xe** (3 = (a1, Xe, B)). 
With w, = 1, (13) is given by 


min 3 Ey, — 4u(8) — 84(8) 


&(B) = n} [> Yr — (8) orl 


t 


&(B) = [> et — yt (x ae [> yor — n-1 ¥) ef >) u|: 
t t t t 


1.1.4 Linearization 


In this section we assume the adequacy of the model, hence the validity of (3). 
As the nonlinearities cause profound difficulties for the treatment of the model, 
an approach that suggests itself is to try a linearization. There are various 
possibilities to linearize the model (1): 


(a) Embedding into a larger linear model [% X V] with F§ < /. 
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(b) Approximation by a smaller linear model [f < V]y with £ < Fé. 

(c) Approximation by a linear model without the assumptions (a) or (b). 
(d) Linearization by transformation. 


(a2) Mostly the embedding into a larger linear model has the disadvantage 
that the used best unbiased linear estimator of 4 — Ey in the linear model 
[f X V]g are not efficient in the original model. Often F‘ < Ff is only fulfilled 
for £ = IR" and the problem becomes uninteresting. Moreover, the estimators 
may have values in £ — ¥* so that the interpretation of such estimation 
values is difficult. But this procedure is sometimes a useful first step in stati- 
stical analysis. For instance, this is the case if, with fixed x-values, there have 
been repeated observations. In case we have a normal model (normal distribu- 
tion) and if Ps is the projector on F in the sense of the norm “”+|-|,, then the 
total ‘information’ from y on uw and o is contained in a certain sense in z(y) 
i= [P”sy, y(I, — P”*) y]. 2(-) are sufficient statistics (see Bunke and Bunke, 
1986), Theorem 2.1.4). In this case P”sy is sufficient for y. 


Example 1.1.2 We consider a regression model with repeated observations, 
i.e. the first n, design points x, equal x, the next n, equal x®), etc., so that 
there are n, observations each to 2 (12 = 1,...,m). With 


Ua Diag (ios ta 
% = (go(a), isles go(x™)), 
F§ = {Ugh |9¢O} and FIC L =A), 


then we have for o; = o: 


Poy = (g™,..., 9) (14) 
n 
(i.e. the vector of the mean values 7 = nt »° x. of the observations be- 
longing to the design point «”) and eet 
GBs) Ye (Yes) )° (15) 
4=1 e=1 


(b) and (c): The approximation by a smaller (# < F*) linear model [£ XK V]x¢ 
means that we restrict ourselves to estimate functions fi: IR" > £ with values 
in £ when estimating . This seems to be reasonable provided that f lies ‘in the 
neighbourhood’ of £. The bias of such estimation functions 


Bf) = ||Ei(y) — fila, A € Mz 
becomes smallest iff 


|Zay) — P2f'lla = 0 (16) 
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holds, where P4u denotes a certain element of £ with min ||u — 2||4 = ||“ — P4ul| 
zeL 


(projection in the sense of the seminorm ||-||,) for aes u € IR". In the class of 
estimation functions that satisfy the ‘generalized unbiasedness’ (16) or con- 
ditions of the minimal bias for all admissible parameters, the estimation func- 
tion fi can be determined with minimal risk E||@(y) — f*||?, in the normal model. 
eee Cee to Bunke and Bunke (1986, theorem 2.7.2), it is given by f(y) 

= P4Pey. 

We are led to similar problems if we want to approximate the function f by 
a function f = a«’h from a linear function space 


Fy={aehi «ne R$ cCF,h: X —>R’, 


where the goodness of the approximation is characterized by a seminorm ||-|| 
in F (i.e. by ||f — g]|). In this case 


£= KA), H = ((hi(o)) 07% 


These problems were treated in detail in Bunke and Bunke, 1986, ch. 2), and 
Hoffmann (1977). There the demand for a minimal risk £||f — &(y)’ h||? among 
all f = &’h with 


Ef — Ps,f\| =0 (17) 


(Ps,f: projection of f onto Fo), as well as the minimization of the unfavourable 
risk (respectively € € J X #H) (minimax approximation) were discussed. 

Nonlinear (in #) regression functions (g5: = IR™-—> R!) are often appro- 
ximated by Taylor’s series of yth order at a point 2%: 


ra 4) di H 
go(x) = go(%o) + DX A le — 2%) ack | Jo(x) 
j=1 v |r 2, 
1 d t 
Sf — ip) Jo(X) 
y! da |x 
od pa : Dy Co(21, OEY) ten) (xy = oy)" ORD (a, se tom)” « 
G=0 tite tin=j 


Thus, we approximate gy by a polynomial in x. If we extend the domain of 
values {C9(t1, ---, 4m) | ® € O} in each case to IR!, then we look for an approx- 
imation in the linear space of such polynomials: conclusions from the new 
parameter vector cy to the old parameter # are not immediately possible after 
such extensions of the model. 

(d) By transformation many nonlinear regression functions, which often 
arise in applications, may be brought into a functional form which is linear 
in the parameters after the parametrization (see for instance Draper and 
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Smith, 1966). If we consider e.g. the exponential model with go(x) = ae®®, 
then we obtain the linear function In f = a + fx with a = In «. For the Coob- 
Douglas function 


Ga(x) = oat ... af 


A 

we get In g(x) = a + ¥) B; In a,). In certain physical problems the function 
i=1 

«/(1 + Bx) occurs, which can be transformed into a linear form by 1/gs. Hence, 

generalizing these examples we suppose that we have a model 


¥:=fa)+e, t=1,...,0 (18) 


with He = 0 and De = oI, f € {gs | 8 € O}, and that there exists a real func- 
tion 7 with 


T99(x) = «'(B) h(x), 


where h is a known k-dimensional vector function. 
It is a common procedure in practice to transform formally the model (18) 
into 


Ty, = T(x) + nN, Pm Aj. 5 (19) 


and to suppose for 4 similar structures as for €, which is of course not generally 
justified. In general the violation of the distribution assumptions for the 
‘error’ 4 leads to the LSE, which are formally computed in the ‘wrong’ linear 
model (19), being neither consistent nor leaving other good statistical pro- 
perties (cf. e.g. Goldberger, 1968). Nevertheless this method can be justified as 
an approximation if only o is sufficiently small. This may be achieved for 
instance by a suitable design planning (choice of the 2) and a ‘model concen- 
tration’ carried out before the transformation. Let us briefly explain this pro- 
cedure, which is described in Bunke (1976, 1977). 

Suppose we have an experimental design with a spectrum = (a, ..., a™) 
and denote the number of observations at the point «% by n;. Then we can 
describe (18) in the following way: 


y = Ugs, + e€. (20) 


Now it is suitable not to change over from the model (18) to the model (19) 

with the transformation 7’, but first to form mean values about the observa- 

tions to the fixed design points in order to achieve possibly small variances. 
ay, ay 


Let@=|: |, G=1,,a/n,; for a=([: |¢R*",a,¢R”. 


Zn Om 
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By taking the means or by ‘model concentration’, we obtain from (20) that 
9 = %, + &. (21) 


If we apply the transformation 7’ to (20), then we get 


t = Ty = Hod) + Tye, + €) — 795, (22) 
with 
hie) 
i= : 
h' (Zm) 


Let 7 be twice-continuously differentiable. Then it holds that 
t = Ho(9) + Qo€ + Wee, 
where Q» and W%, respectively, are diagonal matrices with diagonal elements 


dT(z) 
dz 


d?7'(z) 
dz? 


? 


2=ggla;)+ 55,5 


? 
2=99(t;) 


the 6, are determined values between 0 and &;, é? = (G5) sorte): li bes 


rank k < m, then the estimator 
= Ab (Hi) (23) 


is obviously consistent because of & “+ Oifn;—>oco (i = 1,7:.,m). 
The covariance structure of the approximate model 
t = Ho(d) + Q,€ (24) 
with 
DOs b= a2) 4. 00 Diag [hie cons ee oe 
suggests applying the formal BLUE 4, in (24) by using an ‘initial estimator’ &>: 


Opie (HT a ge DS pt. 


1.1.5 Consistency and asymptotic distribution 
of the least squares estimation 


1.1.5.1 Introduction 


In this section we will give conditions for the consistency and the asymptotic 
normality of the LSE. The first papers devoted to this problem were those by 
Jennrich (1969) and Malinvaud (1970). The difference in the two approaches 
consists in the fact that under the assumption of the compactness of the 
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parameter space the strong consistency is shown in the first work whereas 
in the second work the weak consistency is obtained without any assumptions 
of compactness. In this chapter we follow a representation by Bunke and 
Schmidt (1980), where the results by Jennrich are generalized. The assumption 
of compactness is weakened in the sense that parameters that linearly enter 
the regression function may vary in the whole Euclidean space. Furthermore, 
nonidentically distributed errors, especially heteroscedasticity, and the in- 
adequacy of the model are allowed. By omitting the inadequacy of the model 
we also include the combination of regression and approximation problems, 
which was introduced in Bunke and Bunke (1986) by the name ‘approgression’. 

The class of admissible estimation functions is extended by introducing the 
weighted sums of squares. In this section we furthermore present the results 
by Zwanzig (1980), which contain the consistency and asymptotic normality 
of the parameter estimation in the inadequate model. 

Multivariate generalizations, which we do not consider here, are given for 
‘instance by Malinvaud (1970), Barnett (1976), and Fedorov (1977). In Barnett 
(1976) conditions are formulated under which the multivariate LSE weighted 
with an estimated covariance matrix is equivalent with a consistent local 
maximum of the likelihood function. We do not take into account the case 
that the sequence of errors is generated by a stationary time series, which 
suggests a different treatment (frequency domain). See, for example, Hannan 
(1971). 


1.1.5.2 The model and assumptions 


Again, we assume the model 

y, = flu) + &, pi AOA pennadltedcn (25) 
with an unknown regression function 

{:% —R', Fee: 


We discuss the asymptotic behaviour of the approgression estimation: in our 
formulation from Section 1.1.3 this means the behaviour of the weighted 
inadequate least squares approximation (WILSA) q5. and the weighted in- 
adequate least squares estimator },, (WILSE), as well as of the corresponding 
estimation under the supposition of adequacy ‘f = g»,’. Let us first of all 
introduce some notation: 


For 1, k: X — R! we define 


(Z, K)n oa wl, k)n = 0" »2 w,” U(2;) k(x;) 
t=1 
and 


ml, = PL Dn: 
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where 


aD == tw 1, my Sy os} 
is a double sequence of random variables about which we assume that 


Sq t= max |w}") — u;| "+0, where u =: u(w) = {w,t = 1, 2,...} 
1StSn ; 
is a sequence of positive numbers with O< xl u,Se@<oo, t= 1,2,.... 
Correspondingly we use the notation 


“ly — U3, = 0? YD wl(y, —Ua,))?, yeR* 
t=1 
and 


(1, b)n = ((°Ulis bya) ay? 
for’: XY + R?,k: Xf > R?. 
“UE B= (Ct Hap eee: 


LOE TE sed Re > Wy so 
The WILSA 95. introduced in Definition 1.1.2 is the solution of the minimi- 
zation problem min Q,(#) with 

: 80 


Q,(8) := Q,°(8) = “ly — g3"ln- 


(Where no confusion is possible or where the dependence shall not be specially 
emphasized, we leave out the labelling with the series of weighting w. Con- 
versely, in this as well as in the following chapters, we point out the depen- 
dence of the estimator & on the sample size by writing 5:,.) 

First we formulate some assumptions: 


A, &,t = 1,2, ...are independent random variables with He, = 0, Hs? := oF, 
and we have 
(a) &, ¢ = 1, 2,... are identically distributed with o, = o or 
(b) the €, satisfy a modified Lindeberg condition: 


o,2y>O0 forall ¢ and sup ih x? dF',(x) ——» 0, 


cco 
t izi=e 


where F, denotes the distribution function of ¢,. Furthermore, let 


Oh SO CDs Sal PAs goa 


co n 
> & 2H} < 00 and == 2! )) w07 > t, > 0 


t=1 t=1 Nn—>0Oo 


hold. 


A; 


Ag 
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For the given sequence of functions g”: ¥ x @ > R! we use the notation 


gs := 9(-, 8). We suppose the functions to have the following repre- 
sentation: 


Ge By oh (x,8) (kh: xB R*) 
with 
Ua fk) 6 O == IRE Se (# a compact subset of the IR”). 


There is a function hg: X X #@—IR?, which is continuous in f for any 
fixed « € 2, so that with hy” := h(a, B) 


sup “hy” — he|, >0 
Be 


holds. 
(a) Let # denote the set of functions 


{f, heiy| 4 = 1,...,p; B € B} (where hg;, is the ith component of hs), 
then there exists a real number “(J, &) for all 1, k € H, such that 


sup (2, k)n +", (1, k)| > 0 
Lked 


holds. 
(b) For all 6 € @& let “(hg, hg) be a nonsingular matrix. 
Let a unique solution 3 of 


min *|f — go? = *|f — gos|? 
Prac) 


exist. 
There is a % € O for which f = g»,. Here we have 


Gg = ah;, 0 = (x; B): 
For all 3, 3 € O we have “|gy — g3| = 0 iff = 9%. 


The assumption A, includes the asymptotic identifiability of the parameter 
1, to which the best approximation gf of the function f in the sense of the norm 
«|.! corresponds. With A; the adequacy of the regression model is ensured and 
A, provides the asymptotic identifiability of the ‘true’ parameter 09: 


1.1.5.3 Consistency 


The following theorem supplies the consistency of the WILSA, WILSE and 
WLSE. 


3 Nonlinear Regression 
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Theorem 1.1.1 
1. With A,(a) or (6), Ay and Ag, it holds almost surely that 


£ Ww (n) 
lim If —~ I3 In 
Noo 


= lim “|f — 95,| = 4, (26) 
n—co 
where A, := min “|f — go| (consistency of the WILSA) 
9€0 


2. With A, (a) or (b), Az, and A,, we have 

o, => of (consistency of the WILSE) 
3. With A, (a) or (b), Ao, As, and As, it holds that 

A; = O and 0.3) (consistency of the variance estimator) 
4. With A, (a) or (b), Az, Az, As, and Ag, we have 

d,——> 9% (consistency of the WLSE) 


We prepare the proof of the theorem by means of serveral lemmas. 


Lemma 1.1.1 A, yields 


sup |"(J, k), — *(1, k)| + 0 
Lkes 


Proof. It holds that 
sup (2, k)n so “(1, k)| 


Lked 
< sup |"(1, k), — “(l, &)_| + sup |"(1, bn — “(l, k)| + 0 
Lkex Lked 
because of 
Sp 
"°(2, k)n <= “(, k)»| = “WU “Wkly (27) 
x 
and 
w Sn as. 
sup | (J, k)n i “(L, k),,| = sup “\, “kl, Ta 0, 
Lke de Lke Fe Ma 


since sup “|/|,, > sup *|/| < co follows from A;. 
led led 


Lemma 1.1.2 Let Ay, As, A, be fulfilled. Then 


sup “(1, #), ——> 0. 
le KH 
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Proof. First we show 


sup “(I, €), ——> 0. (28) 
led 
Because of A, the conditions of the strong law of large numbers is satisfied 
for “(1, &)n. Hence %(1, €), ~=+ 0. 
For the proof of the uniform convergence (28) it is obviously sufficient, by the 
definition of # for fixed 7, to show that 


a.s. 
sup “(heiys E)n Sree le 
B 


Because of A; and the continuity of hg; there exists, for each 7 > Oand Bf € &, 
a neighbourhood Us of 6 and an n(n, 8) with “|hgaiy — hg iy|n < 7/2 for all 
B€ Uandn > n(n, B). Because of the compactness of @ there is a finite cover 
U = (Uz, } of # and an n(7) such that 


sup “hg iy — hiviyln <0 
BbeUg 


holds for n > n(y) and Uz, € U@. Because of A, we have “|¢|%, =*+ 7,. Now 


we consider a realization with 
le|2 > ty and “(heiys €)n > 0. 
“(Wp iys €)n S “Way — Mp ncinin “lEln + "(hg &)al Be Us, 
there is an n(n) with |"(hgiy, €)nl <7 for all n > n(n) and B € #. Thus (28) is 
shown. 
We use the representation 


“(L, &)n = “(L, &)n — Sn- (29) 


With this we have 
n (n)]\2 
g, = (= Dd Uil(xz) &¢ i ae } 

eal 


2 n 
< + 5 & Eel). (30) 
ua NT t=1 


i pam s. 5 
In the case of A, (a) it follows that — ¥ & =*+ «2. In the case of A, (b) it 
n nm t=1 


follows that ea >) (e? — 0?) + 0 according to Kolmogorow’s strong law of 
nN t=1 


3* 
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large numbers, because 


ta | a 
Dg Le = Gh) NG, eo 
t=1 e t=1 t 
Because of s? “+ 0 and ue “|, < co we obtain sup &2 ==> 0 from (30). To- 


led 
gether with (28) and (29) this yields the assertion. 


Because of A;, |f — «’hg| is defined for all « € IR? and B € #. Consequently 


min “|f — o/hg)® = *|f — oghal? =: d(B) 

acl? 
with 

op = “(hp, he) * “(hp f)- (31) 
ag and d(f) are continuous in f because of A, and Aj: In d, = (4, B,) the 
structure of @, is given by 


bn ad “(hp hg) a (a y),- 
Furthermore, let 


dp =H, WE HOLY, wa. 


Lemma 1.1.3 Let A,, Ay, and Ag be fulfilled. Then 


di, — xp, + 0 (32) 
holds and 

sup ||&; — «,|| —> 0. - (33) 

BEB 


Proof. From Ay, it follows that oD 4“IAs| > sup “|ha | and as in the proof of 
Lemma 1.1.1 we show: BEB 


(n) p(m)\ as. 
(rg gn) > Mp, hs) 
and 
(n) (n) a.s, 
(hg é), = “(hg oe he.» e), “1 (hg, &) 7.0 
by exploiting Lemma 1.1.2. 
With 


"(hp » hg = %( w (hg. f) fn aie, 0 


(32) follows. 
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As in the proof of Lemma 1.1.1 we then get 


sup "(hi hi”), — “(hp, hz) > 0 
BEB 
and 


sup (hs), fn — “(he, f) > 0. 
BEB 


A, and Lemma 1.1.2 yield 


sup “(h5”), £), ——> 0, 
peB 


which yields (33): 


sup °(hs", hy?) (AY, Yn — “(legs hp)? *(hep, f) 
beB 


= sup ||@s — x,|| —> 0. 
Be® 


Proof of Theorem 1.1.1. Because of the continuity of d(f) there exist a Bf ¢ B 
with d(6!) = min d(f). According to this, 
pe® 
Ay = "lf — Goel with Oo (apr, BS). 


On account of sup ||«s|| < co and Lemma 1.1.2, we obtain 
Be 


“(ap lia en = 94, (hp,> En “> 0. 
With Lemma 1.1.3 this yields 

“(tf — 98,2 &)n = “(t &)n — [on — BY (hg, En) 

— ap (hp, &)n > 0. (34) 

From A, it follows that 

TO a 
From Lemma 1.1.1 it follows that 

Disses ls i 98.) +.0 (36) 
(34), (35) and (36) yield 

“ly — 99,12 — “If — 92,1? — t» —+ 0 (37) 


Because of ||sup x,|| <oo and Lemma 1.1.3 it almost surely holds that 
lim sup ||&,|| < co and with A, it follows that 


(n) 


w (n)/2 
Oe a KE, 


n 


= &,"(hy, — hg, hy, — hg. )y Gn => 0. (38) 
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From (37) and (38) we obtain 
O.(9.) — “If — 98,2 — tw = “ly — 98,/5 — “If — 96,1? — to 


+ “Igg. . + "(y — 93,98, — 94.) “= 0. (39) 


ww 


Analogously we can show that 
(84) — “If — gorl® — ty + 0. (40) 


Together with 0,(d,) = O,(0) a.s. (36), (39), and (40) yield assertion 1. 

If A; is additionally fulfilled, then 4; = 0 holds and assertion 3 is a direct 
conclusion from (39) and (26). 
Now we prove assertion 2. Because of Lemma 1.1.3 it suffices to show that 


A 


ee a.s. pr. 
We put 


d(B) := “|f — aghel? 
and 
d,(B) := “ly — ogheln — 
Analogous to the proof of (37) we can show, by applying Ag, that 


sup |d,(8) — d(B)| + 0 (41) 

BeB 
Let 8 be a limit point of the sequence B,. Then, because of the compactness of 
#, there exists a subsequence Bn, converging to B. Because of the continuity 
of d(6) and of the uniform convergence of d,(f) to d(f), it follows that d,, ( Bn,) 
=*+ d(B). As Bn, is WILSE, dn, ( Bn,) ) <d,,(B), hence d(8) < d(6/). On account of 
A,, d(8) has a unique minimum at f = $f. Thus it almost surely holds that 
8 = Pf. The proof of assertion 4 follows under A; and A, immediately from 
assertion 2. 


11.5.4 Further assumptions 


Now we will investigate the limit distribution. For doing this we need some 
further assumptions. 


A, hg and h have continuous derivatives of first and second order with 
respect to 6. Let A, be true for these derivatives, too, and let A, hold for 
the class 


Ohpiy Oh ai 
Hy i= i) lea iin eee eT — 


ial SAE 1h Je caval Ge {TS Ta he Os 
OB, ” 2B,0B | ‘ 


{ 
5 . . . . 
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Moreover, let ® 
Yn sup *[haiy — WM"), > 0, 
Be® 
"\Ohaiy heey 
op op 
Ag (a) Foralli,j7=1,....p+ mand ky :=k:= = 
B=? (hy, «2-3 Kom) 


we have 


n 


¢y = linnn-* s opurk;(x,) kj(x;), 
n—>0o t= a 


and the matrix C = C(u) = ((c)) is regular. 
(b) # is an inner point of 0. 
(c) It holds that 


ke(ary) €¢(U, — Ww”) =+0. 


| 
Ths 


Remark 1.1.1 With A; (f = g»,) and A, it holds that #f = 


39 


8) so that, in the 


adequate case, A, is equivalent to the corresponding condition formulated 


with Do. 
Ay (a) The matrix 


yes lee saline 


is regular. 
(b) It almost surely holds that 


Nee m+p 
8=05] /7=1,...m+D 


ay Yn “(kor t — Jorn < 00: 


1.1.5.5 Asymptotic distributions 


In the following theorem we derive the asymptotic distributions of WLSE 


and WILSE. 


Theorem 1.1.2 Let A, (a) or (b), Az, and Ag be satisfied. 
“(k, k) be regular. With 


1. Let A;, Ag, A, and Ag be fulfilled and let B-4(u) := 
the notation 


M(u) = B(u) O(u) Blu) 
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it holds that " 


LL n (Sn — Oo) > (0, M(u)) 
(limit distribution of the WLSE). 
2. If Ay, Az, Ag, and Ay are fulfilled, then 
Ln (Sy — Hf — 2G(u)}+“(k, f — Gorn) 
—> (0, 4[G(u)}4 O(u) [G(w) 2) 
(limit distribution of the WILSE). 


In order to prove Theorem 1.1.2 we need the following lemma: 


Lemma 1.1.4 Let k be a vector function defined on X —IR?*™. Under the 


n 
assumptions A, and Ag (c) and provided that the limit cj; := lim n=! Y ofuzk;(2;) 
xX k,(x,) exists and C := ((cis)) > 0, it holds that n>co 8 t= 1 


£{n “(k, &),+ > N(O, C). 
Proof. Under A, it holds, for 
iia Vn 4k &), 


that 
ZA,} > N(0,C); 


because of the central limit theorem in the formulation of Hicker (1966) (com- 
pare also Bunke and Bunke, 1986, [A 4.20]). On account of Ag (c) it further 
holds that 


k(x1) & 
As a result of the representation 

%, i= Yn “(ky en = ha + Sn 
we thus obtain 


£{z,} > N(O, C). 


(w\” — Ut) aan 0: 


I 
Ms 


Sn 


Proof of Theorem 1.1.2. 
(a) Proof of assertion 1: We proceed from the derivation 


dQ.) oe 


zy (eS 98” — dn = 2°(KS?, gS — 99) )n 


— 2° (KS, €)n + 2°(kS”, 9) — Go,)n 
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and use a Taylor expansion of second order of gs at & = J. With 


1 dn € Gint 
Pr = n : = 
0 Do, q ow On = Vn (3, fe Bo) 


we have | 
Vn 4Q,(8) 
Ney we ey ars = nl Ldn n— Sn ni 
eae ee PalLidn + Ty — Sy + ty] = 0 (42) 
with 
w Ps 2 (n) 
Le iiks ees oe 
(xs Bo is og ae Be 


a Vn “(kg Is — 9 ay 


§, = Yn *(ho,, €)ns 
f= Vn “(ko, — kh. , €),.- 
Similarly to the proof of Lemma 1.1.1, it follows from A, and A, that 
L, + *(ko,, ko,) (43) 


From A, and #;, ~=+ 9%, we furthermore obtain that 


lal S lI(ko,, kg all Vo “lg — gol, => 0, (44) 
hp, — he, 
kg. — ko, =|. oni” , Ohe ee, 
a, Fee aaa 
OB |p=p OB |p—p, 


and from this 


a.s. 0, (45) 


Yn “(kg. — ko, ke. — ko,)n 


l€n| S Ela 


because of A,. 
Finally, from Ag (a) and Lemma 1.1.4, it follows that 


£{§,} > N(0, C(w)). (46) 
On account of A, (b) and Theorem 1.1.1 we have 
Pnr——> 1. (47) 


Then assertion 1 of the theorem results from (42)—(47). 
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Before we focus our interest on the proof of 2, we want to formulate one 


more lemma. 
By Q/,(3) and Q%,(#) we denote the vectors of the first partial derivation with 
respect to @ of “ly — g¥|? and “\y — gs|2, respectively, and correspondingly, 


by Q7(9) and Q’’(9) the matrices of the second derivation of *|y — g$”|? and 
Sl-Gal*. 


Lemma 1.1.5 Under the assumptions in 2 of Theorem 1.1.2 we have 


(a) Vn |102(9) — 67(9)|| “5+ 0 for all 9 € OM, 


(b) ||O% (F,) — Q''(9!)|| + 0 for all sequences {F,} for which |\%, — 9| 
< |S, — || holds almost surely. 


Proof. Because of 


0,(3) = "(ks ke 2Y — $n 
and 
O(9) = —2 (ko, y — 9o)n 


and A,, the assertion (a) follows similarly as in the proof of Lemma 1.1.1: 
= sin Qn(8 B\ Sly — 99 ln Vr It(kS? — ko, kG? — ho)all 


+ |["*(ke, ke)all Vn “lgS” — gola ——> 0, 


as we can show 
0,(9) = “ly — 912 > “If — gol? + ty < 00 


analogous to the proof of (39). 


. a?g™ 09 
Furthermore, with KY) — , Ko = — ,, it holds that 
Op og 
n(O) = 2°(k5, hE), — 2K, y — gh) (48) 
and 
Q"'(F) = 2%(ko, ke) — 2K, f — go). (49) 


We consider the estimation 


On 


(n) kz. )y 


OKT.) — Q'S 2|"(k52, Bsn — “Cor, bor) 

+ 2l|*(eor, bor)n — “(hors kor ll + 2"(Ke, ~ Kors t — Gorn 
| 2|""(K3., gor —95.)n 
+ 2| "(Kz 8), 


+ 2ll"(Ko57, f — Jor)n — “(Kor f— gos) 


| 
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On account of A; and A, the second and the fifth summand on the right-hand 
side tend to zero almost surely. 
According to Theorem 1.1.1, 9, “+ 9, and with A, it follows that 


(n) 
(ts, — kor, ke. — keor)a|| 22> 0 : 
and 
Zee seap a egy) Cat as. 9 (50) 
5 la oe a Seo rate > 
i,j 08 ; 09; o=3,, oo, 08; o=9!|n ; 


so that the first, second and third summand tend to zero almost surely. 
Because of (50), “|€|, ——> t, and Lemma 1.1.2 (for # = H ,), we have for 
the last summand 


(Kz, ),|] < const Z,"°|eln + IP°(Kors €)all 22> 0. 


(b) Proof of assertion 2 of Theorem 1.1.2. 
Now we procceed from a Taylor expansion of first order of Of, at # = W: 


0 = gnQi (Fn) = GnQi(02) + OF (Fn) (I, — OF). (51) 


According to Lemma 1.1.5, 


t = Vn [O%,(0!) — O'(9)] + 0- (52) 
and 
QO" (3:,) “+ Q(t) = Gu) (53) 


With V,, = “(kgr, f — Jor)n We obtain from Lemma 1.1.4 and (52) that 

LLY n [O4(9!) + 2V,]} = L{—]n 2 (legs, &)n + tr} > N(0,4C) (54) 
From (47), (51), (53) and (54) it follows that 

£Ln (5, — 0) — AG(u)}* V,}} > N(0, 4[G(w)}* C(w) [4(u) >). i 


Remark 1.1.2 The model error produces the correction term 2[G(u)]~1 V,, 
i.e. d:, is consistent, but for yn (5 — 9), we can not generally assume the vali- 
dity of a limit distribution with the expected value zero. The correction term 
does not occur if # is replaced by #f := arg min” |f — gol, (compare Bunke, 
1981). 


1.1.5.6 Special cases and related results 
Remark 1.1.3 Now we consider the special case of the homoscedastic (of = 0”) 


adequate model (f = g»,) and of the nonweighted LSE. In this case the weight 
function is w, = 1 and we omit the labelling w and wu, respectively, on the sca- 
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lar product ("(-, ZA eat | (as -)): 
d, = arg min Q,(8), Q,(8) = ly — goli- 
d€O 
Let the assumptions of Theorems 1.1.1 and 1.1.2 be satisfied. Then 
OQ, (In) > 0? 
and 
L{/n (Sn — 9o)} = N(0, 0°(k, k)-1), 


Pit eee oe 
OF | 9=0, 


This result corresponds to theorem 7 in Jennrich (1969). 


Remark 1.1.4 If & ~ N(0, o?) is valid, then the maximum likelihood estima- 
tor and the GLSE coincide and we have 


£L)n (Fe 3o)} = N(0, *o(k,k)-}) 


with w, = {o; *}. 
Thus, the covariance matrix of the asymptotic distribution has the form 


0g9(21) a\e aed) 
< ee 
an oo = ) 


Remark 1.1.5 It should be mentioned that the assumption AZ is rather 
restrictive. For instance, it rules out the special nonlinear regression function 


g(x;, 3) = (¢+ 8)* fora 21/2: 


i ie awa) eg COs) 
(ky k) = (tim ‘2 3 ae 


n—>0o oo 


Wu (1981) gave sufficient conditions for the strong consistency as well as for 
the asymptotic normality of the LSE which admit other growth rates than n 
for . 


D,(8, %) = a (o(21, 8) — glee 9)? 
=1 


to infinity. For details the reader is refered to that paper. 


Remark 1.1.6 We have assumed that the nonlinear part 6 of the regression 
coefficients has a compact range # in order to assure strong consistency. 
Recently, Lauter (1989) proved that the compactness assumption can be re- 
placed by the following conditions: 


For every % there are constants c(%)) > 0 and d(%) > 0 such that for all 3 
with |||] > d(%) and for all n, 


Y (ale, 9) — glee, Bo)? > et) 
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Furthermore, the set of all parameters # fulfilling 
\[F|| S d() is compact. 


Remark 1.1.7 As is well known, the LSE is very sensitive with respect 
to ‘outliers’. More robust estimators were constructed by Huber (1964) within 
the concept of the M-estimators. For the nonlinear regression model Grossmann 
(1976) introduced, subsequent to these ideas, estimators which are the solution 
of the following minimization problems: 


n 
of : min my AWE or go( yt) > 
80 t=1 
where 
( 2 
cl D 
| 


forz2—ak 


k\z| —— fr |jz| 2k 
2 


( 


For these estimators the asymptotic normality and the existence of a k* for 
which 0" in (9 | kjo <k < kyo} has thesmallest asymptotic covariance matrix, 
is proved under certain assumptions. It is suggested to estimate this covariance 
matrix, which depends on the &;, and to proceed in two stages. 

So far we have provided some contributions to what is called first-order 
asymptotic theory for the nonlinear regression model. It is well known that 
first-order asymptotically efficient procedures can be poor for moderate sample 
sizes. Usually this is demonstrated for some special cases including numerical 
comparisons or simulations. In fact only the linear term of the Taylor expan- 
sion of the nonlinear regression function yields a contribution to the limit 
distribution of various test statistics. Thus, as long as first-order asymptotics 
are considered, one gets the same results one would get for the artifical lineari- 
zed model linearized at the true parameter, i.e. 


B) , 


neglecting the dependence of - g(x;, 0) on &. Therefore, almost no charac- 


teristics of the curvature of the regression function influence the statistical 
inference based on limit distributions (see Bates and Watts, 1980; Hamilton, 
Watts, and Bates, 1982; Amari, 1982). On the other hand, first-order asymptotic 
statistical procedures admit an approximation accuracy of O(n-¥/2) at most, 
which can be too poor for applications. Therefore, one would like to derive 
statistical procedures which hopefully can also be applied for moderate sample 
sizes with a suficient accuracy. Following the Pfanzagl school (Pfanzagl, 
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19732, b; Michel, 1975; see also Chibisov, 1972, 1973a, b; or Akahira and 
Takeuchi, 1976), the idea is to improve first-order efficient procedures using 
Edgeworth expansions for the statistics involved. In Schmidt and Zwanzig 
(1985) stochastic expansions as well as Edgeworth expansions of second order 
for the LSE and the residual sum of squares are used to construct sequences 
of tests, confidence regions and estimators which possess an approximation 
accuracy of O(n-1/2) and which are second-order asymptotically efficient if the 
observations are normally distributed. For instance, an estimator a; 
= 8,(y1, ---» Yn) for 0; is constructed which possesses the following properties. 
Let us denote by P,; the distribution of VY, = (y1,--., Yn)", € = (8, 07, x) 


1 
where x is the distribution of — ¢,. 
o 


(a) &; is asymptotically median unbiased of order O(n-¥/?2): 


P,(9; = 8) = - ~ O(n?) 
and 
PylS; <8) = - — O(n-12) 


both inequalities holding uniformly in 0 = P| from compact subsets 
of the parameter space. e 

(b) &; possesses greatest coverage probabilities for shrinking intervals in the 
class of all asymptotically median unbiased estimators of order O(n~1/2): 


Patt; — n-U2hp < 9; <3; + n-W2h} 
=> Pi (9, —n-V*h << 0, + h-VPh} + O(n-¥?) 


uniformly on compact subsets of the parameter space for all estimators 
9; = O(Y1, --.Yn) Which are asymptotically median unbiased of order 
O(n-2), 


Here the estimator 8; has the following structure: 


~ 


x 1 Re 
ov, = 0; + — K(O) 
n 


where K(@) is a bounded function of 6 on compact subsets of the parameter 


A 


range and K(@) is bounded in n. Moreover 6 = i is the composition of the 
6 


LSE and the residual sum of squares as well as §; denotes the LSE for B;. 

Notice that K(@) involves second-order partial derivatives of g(x, %) with 
respect to #. For details the interested reader is referred to the paper mentioned 
above. 
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1.1.6 Asymptotic optimality 


In this section we will establish two different asymptotic optimality properties 
of the LSE. It will be shown that the GLSE and thus the two-stage estimator 
6, is asymptotic best WLSE in the adequate model, due to the comparison 
of the corresponding limiting covariance matrices. Furthermore, assuming a 
normal distribution for the observations, asymptotic optimality of the GLSE 
is true in the larger class of all asymptotically normally distributed estimations. 
There the normal distribution is necessary for asymptotic optimality. Then we 
will investigate the question under which condition the maximum likelihood 
estimator is asymptotically optimal with arbitrary distribution assumption. 
In Theorem 1.1.2 under the assumptions A,, Ay, As, A;, Ag, Az, Ag, 
Bu) = (ko, ke,) € Mr 


m+p? 
and 


n—>co 1 t=1 


Coe (tim Se aia (0)) em, 
i,j 


£40 (S_ — Io) > N(0, M(u)) 


with M(u) = B(u) C(u) B(u) was proved. If, analogous to the notation in the 
linear model, we introduce the matrices 


X,= : 5 2a = Ding [o;z;.-., a2] 


and U, = Diag [w, ..., Un], it obviously holds that 


noo 


1 af 
M(u) = lim S XU.X,) "5 LX SL ke ( NAY .) 
n 


= lim 10,2,14, (56) 


with L, = (X/i,U,X,)1 XU 
From the Gauss-Markov theorem (cf. Bunke and Bunke, 1986, theorem 2.1.1) 
for any matrix L, with L,X, = I, 
feel (Xe DX, \ 
and thus 
M(u) = lim n(X1,271X,)1 = %2(ks,, ks)? = M (wn). 


no 


That means the following theorem is true. 
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Theorem 1.1.3 Let W@ be the set of all weighting sequences w for which the assump- 
tions of Theorem 1.1.2, part 1, are fulfilled. Then M(w,) S$ M (u(w)) for all 
we Wifw,€ W.(Weuse A < B for two positive semidefinite matrices A and B 
iff B — A is positive semidefinite). 

The optimal weights 1/o7 are unknown in general and thus the GLSE can 
not be computed. But the two-stage estimator #,, has the same limiting distri- 
bution as the GLSE and is asymptotically best WLSE, too. As in the linear 
model one can ask here under which conditions the OLSE 


A 


5, = "d, with w) = {1} 


which can in general be computed more easily, is asymptotically optimal. 
Because of (56) and the results of Kruskal (1968) (see Bunke and Bunke, 1986) 
this is fulfilled for instance if for all sufficiently large n, 


holds. (2(X,) is the linear subspace in IR” generated by the columns of X,,.) 


Example 1.1.3 We consider an exponential regression model with repeated 
observations in two different points x), %2) € IR?: 


Y, = axe + &,, He? = Oy ECM =a, 


with 4h —- (a a a Ny}, Is — {ny s- itr seeg Ny, — No} 
nt Oe, by Be 
n 


n—>0o 


Then all assumptions from Theorem 1.1:3 as well as (57) are fulfilled and the 
OLSE is asymptotically optimal if « € IR! \ {0} and if 8 varies in a compact 
interval in the IR!. More generally, the OLSE is asymptotically optimal in 
repetition models 


Yt = Jo(Xj)) + &, Ee; = Oj) » Core 


with Y Pate PaaS |Z’; | =N;, pa Mig = {Dares 5570} 


and He 381 4= 1.5m + p 
n 
if A, holds. 


In the following we will define what we want to understand by a best asym- 
ptotically normally distributed (BAN) estimator. For this purpose we need 
a parametric description of the distribution family for the observation vector 
Y, = (Yt, +++» Yn)» Let x, be the distribution of «, and let % = (1, x, ...). 
Let # denote the set of all such sequences x; then the parameter ¢ = (8, x) 
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characterizes the distribution of the sequence (Yy,, y2, ...). In the following let 
P,» be the distribution for the vector Y, induced by that distribution. 


Definition 1.1.3 A sequence {.,} of estimators, 
$,=4,(¥,) with — -£(Vn (&, — 4) | ¢) + (0, VO) 


as said to be a best asymptotically normally distributed (BAN) sequence of esti- 


mators for } if for each other estimation sequence {%,} with £ (Vn (5, — #) | ¢) 
—> N(0, S(¢)) it holds that 


V(c) S S(¢) 


for all 9 € ON N and for all x € K with ¢ = (8, x). Here N — O is a Lebesgue- 
zero set. 


Next, by using general results by Bahadur (1964, 1967), we derive a lower 
bound for the limiting covariance matrix of an arbitrary asymptotically 
normally distributed estimator. To do this we need some regularity assump- 
tions: 


Ajo The distributions x, of & are absolutely continuous with respect to a . 
o-finite measure yw. The logarithm 1,(&,) of a positive variant of the 
p-density of x, is twice continuously differentiable, 


(1% = — L,(e), P(e) = + ne) with 


(a) EUM(e,)} =0, BP (e,)} = —DYY(e)} > —o, 
(DI%E,)} <0o, t=1,2,:. 


(b) i!” is uniformly continuous, uniformly with respect to ¢ = 1, 2,... 


(c) for each ¢ there exists a function R, and a 6, > 0 with ER,(&) < oo 
and |l(&, + h)| S R,(é) for |h| S 6; 


(d) zs > 8, and se Y Dil (e,)} with s, = D{Uj(é,)} are asymptotically 
We | NW t=1 
bounded. 

Moreover we demand: 


Ai (a) g(a) is uniformly continuous with respect to t = Pe oe. sce OS 


+0. 


os max |l|ko(x;)l? 
N 1<t<n 


noo 


(b) Amax [Ko(a)] S ¢ < co with a constant ¢ not depending on é: 


4 Nonlinear Regression 
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Lemma 1.1.6 Under the assumptions A,, As5, Ayo, Ay, and Wz, Wo, 8 € W with 
s = {s,}, it holds for each h € R?*™ and for each inner point } of O that 


SAVIO Urea) et ara ga), oa) = h'I(£) h, h'T(C) n) 


with Cy = (8 +n Wh, x) and I(6) = %(ky, ky), and thus the sequence {Pro} ts 
contiguous to the sequence {P,;} according to result [A2.2], where L,(Yn, 6) 
denotes the logarithmic density of Y,. 


Proof. With 3, = 3 + n-V2h, we have: 
Ay: = TAs. es) an LXas ¢) 


= > [uly re go,(%t)) aa Liye Ta go(a))] 


=a 2 [Alen + Jo(X1) — Yo (&t)) _ 1,(e4)]. 


For U;(€; + Gin) with Qn := 9o(%1) — go,(%,) we use a Taylor expansion up to 
the second term around ¢, and obtain 


A, = => And) (&,) aes ~ Yaz,l( (& + A@in) =: An + Br 
t=1 
with At => Ai(€t) and [Ai < ls 
A further Taylor expansion of g» (x;) around gs(x;) yields the existence of 
real numbers 6, with |6,| < 1 and 


1 
A, = oe 5 h' lal %1) I (e;) = SoS Kae anit) hl (e;) 
n t= 11 N t=1 
=O pene 


Due to results from Bunke and Bunke (1986, theorem 2.4.3 and lemma 2.4.2) 
it holds that 


£(C,, | 6) >> N(0, 4'1(6) h), (58) 


since the conditions H{I{)(¢,)} = 0, D-C, ——> h'I(¢) h and 


2 n—>0o 


1 
max — h’ks(x;,) k»(x,)' h ——> 0 


are satisfied. Furthermore, D, ts 0 results from £-D, = 0 and 


D(Dn) = > Y (WK 54 5n-un(@e) h)? sy) S — ~ |! ae -s See Os 
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Next we show 


1 
[pers —S WI(e)h, 


which proves the lemma. With 


1 
by, >= =— A'keg(x;) heo(a4)' h 
2n 


n 


iW nal 5h Keo (X1) bh’ K 5.5 5,n-a/2n(1) 


1 ! 
+ Qn? (h K 54. 5n-/n(Xt) h)? 


we have 


We show G,, — at | es h'l(C) h and F, Pay According to the definition 


of b,, it holds with 


1 n 
Gi, = Sit uh Tes (2,) Ko(a,)’ AUP (&,), 


Gen = D (2'K 54 5yn-210n(@1) 2)? Uy? (Et) 
tes 
and 
Gin = 3 h’K 5.5. 5n-2/2n(%1) he’ keg (ae) UY? (€1) 
nN t=1 
that 


G, ae Gin Gon Gign O 
1 ; 
G,, converges in P,; probability to xr h'I(¢) h since 


1 1 
1 are h' %(ko, ko)n kh — > —— WI (C)h 


n—>co 9 


4* 


52) Chapter 1. Parameter estimation and testing hypotheses in nonlinear models 


and 
DG =Te ys oa II {AI [? D-{1?(e,)) 
< see max [ho [ale — a Delle er 0 
n 
hold. From 
1 n 
E:Go, = —— » (2'K 94 64n-2/2n(1) h)? S&S ale ae = 2 So 0 
2n? t=1 2n 
and 
1 n 
DGizn Sarr na (h' K54.5,n-12n(X1) h)4 Del (e1)} 
4n* 124 


< + jn + -> De{l(e,)} ——> 0 
=1 


P, 
Sat) 


it follows that G2, eS 0, and analogously it can be shown that Gz, 
It remains to prove that 
nm 
= 2h fin(lt” (&¢ + Aydin) — UP (e )) 
with 
i; pha ERC ja : (h'K (a4) h)? 
ities on Bt) MO\t On? 9+ en PAR t 


1 
+ Tale h’kes (1) WK 5 + 54n-10n(%t) A 


converges to zero in P,; probability. 

Because of A,, and |A,| < 1, |A,a;,| is uniformly in ¢ smaller than any positive 
number 7 no matter how small 7 is if only n is sufficiently large. And, taking 
into account Aj (b), there is an m) for each 7 > 0 so that for all n > ny 


sup # (L{?(& + AQin) — l?)(&,)| 7) 


1StSn 
is fulfilled. The inequality 
LF, S sup Bs |UP(&, + Adin) — YP (€:)| Dd Ifenl 
t=1 


1StSn 
and 
e 1 
Weal 5 —S eA 
implies 
E,\F,| 0 andthus F,-“>+0. 
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In case % is arbitrary but fixed, Y, has a distribution in the family {Paz |¢ 
= (9%) © OX {29} with the finite-dimensional parameter # € 9. By Lemma 
1.1.4 and [A 2.8] we have for each estimation sequence {9,} with 

Ln (F, — 9) | (8, x)) > N(0, S(, %)) for all 0 € O 
that 

S(8, x9) 2 I-*(9, x) for almost all %. 


Tf, moreover {@,,} is an estimation sequence with 

L(x (B, — 8) | (8, x) > N(0, S(, %)) for al 
¢ = (8, x) € &, then this yields 

S(8, x) = I-1(8, x) for almost all @ and for all x. 
Therefore we have proved the following theorem: 


Theorem 1.1.4 Under the assumptions of Lemma 1.1.6 I-1(¢) = (*(ko, ky)? 
is for almost, all 3 a lower bound for the limiting covariance matrix of an arbi- 
trary asymptotically normally distributed estimation sequence. * 


Remark 1.1.8 If {9,} is an estimation sequence with ¥£ (Vn (D,, — 9) | t) 
> N(0, S(d, x) and if, for each fixed x, S(#,x) and I(#, x) are continuous in #, 
then S(f) = J-1(C) for all € € &. Namely, if YW —@ is a Lebesgue-zero set, 
then to each # € IN there exists a sequence {#,,} of parameters 3, € W° with 
On = 8 and we obtain 


m m—0o 


S(C) = S(S, x) = lim S(», «) = lim I-1(8,,, x) 


m—>co m—>oo 


= I7*(9, x) = 1X6). 


Now, from Theorem 1.1.4, the BAN property of the GLSE follows easily (and 
consequently that of the two-stage estimator #, and that of the OLSE under 
(57) and under the assumption of a normal distribution. 


Theorem 1.1.5 Under the assumptions of Theorem 1.1 .2, part 1, and of Lemma 
1.1.6, and «, ~ N(0, o?), the GLSE is BAN. In case the e are identically distri- 
buted, it holds with Cy = (8, #) that M(w,) = I-*(Co) iff & ~ N(0, o?). 


Proof. Because of Theorem 1.1.2, M(w,) = (ko, ks,)"+ is valid and under 


&, ~ N(0, 07) we have w, = {os = {s,} = s. In the identically distributed 
Of 
case, 


1 


= —— (ky, ky,)4 = I-(% 
Dey Crt (Co) 


M(w,) = o(kg,, ks,)7* 


implies D,,(1(¢,)) = 0-2, which gives with [A 3.13] & ~ N(0, 0°). i 
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Thus we have shown that the GLSE is a best asymptotically normally distri- 
buted estimator under normal distributon. But, in this case it is the maximum 
likelihood estimator, too, and it can be expected that the MLE, computed under 
an arbitrary distribution assumption, is always BAN. This will be proved 
in the following. For this purpose we first show the strong consistency of the 
MLE in Theorem 1.1.6 and derive its limiting distribution in Theorem 1.1.7, 
where we restrict ourselves to the case of the adequate nonlinear model 


y= Jo,(%t) + &, t= Le 2, ao, 


with 3 € O C R* and compact O. 


Definition 1.1.4 Each measurable solution &, of 


L,(¥n, Op) = sup L(Y, 9) 
90 
with L,(Vn; 2%) -> l (ye -- go(2)) is called a maximum likelihood estimator 


(MLE) for 3. As, by assumption J» 1s continuous in 3 and O 1s a compact subset 
of the IR*, there exists at least one MLE according to [A 3.1.}. 


Theorem 1.1.6 Let the following assumptions hold: A,, Az, As, Ag, Ayw(a), 
Ajo (d); 


S,Wy WW; 
go(x,) ts bounded uniformly in & and ¢; 


1 


co 
2 
hod 22 


we 


BUT; (&:) I < 00; 


_ 


for each sequence {d,} of real numbers d, there exists a sequence of measurable 
functions R, and a positive number c with 


Iz + d;) = RAz) forall z € R?}, 
ER,(&) S —ce, t=—1,2,... forac > 0; 


S-1 
p> Pp EL R,(&;)* < oo. 
A 


Then for each MLE %., we have 


a aS. 
Dn SSP Do. 
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Proof. The proof proceeds analogously to the proof of the consistency of the 
WLSE. Because of 


L,(¥n, 9) = Daly a tk (24) 


1 n 
o > az l} (8, + da) 
t 


=1 


with a; = g»,(x:) — go(x,) and 0; = 6,(&;), |d;| < 1 we have with 


1 n 
A,(9) = — Y adl}(&,) and 
B,() -= — >) ajl?(&, + Ja) 
2n t=1 
that 


1 
4 (Ln(¥n, #) — Ln(Yn, %)) = An(O) + B,(8). 


According to the strong law of large numbers A,(#) ——+ 0 results for each # 
foe} 2 


if the series }) —, converges. But this follows from Abel’s convergence crite- 
rionand ‘1 


1 n 
— sa; =* 


lg — 9o,\n moar *l9e — ¥o,| 
NM t=1 


With the analogous arguments used in the proof of Lemma 1.1.2 one shows 
that, moreover, 


sup A,(o) > 0. 


EO 


In the next step we derive an upper bound for B,() 


n 1 n " 
B,(?) = = > Cr | (e, + 6)a;) S Pie x a; Rie) 
f= t=1 


1 
= aes a; ( Re(er) = ER,(e:)) ie Arh — >: a LR (1). (59) 
2n t=1 i-1 
The first summand of the right-hand side of (59) converges uniformly in # 
almost surely towards zero (again as a consequence of Lemma 1.1.2), and the 
second summand is bounded uniformly in # for sufficiently large n by 
1 C 
a S a BR, (&:) = —> [ge 


Cc 
— go|, = —— |9e — 90,” 
2n t=1 i 4 
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Until now we have shown that, for almost all error sequences {¢;}, there exists 
an integer m = no({e;}) so that uniformly in #, 


il 1 
ar (L,(Y a; 8) — Ly(Yn,%)) S ree lgs — 9o,| (60) 


holds for all n = 7. 

Now let &,, be an MLE and {e;} a sequence with (60). Because of Bn(€q) «+ +€n) 
€ @ and because of the compactness of O, each limit point of {8,} lies in O. 
Let # be an arbitrary limit point and (F,,} a subsequence converging to #’, 


then it holds for n = np that 


1 1 
Os om ade ee Bn,) er LAY» 3o)) = ass ¢ \95,, me 9.0 i 


1 Pe 
pea carne go — go,|"- 


From this it follows that |g», — g»,| = 0 and with A, we obtain ® = %. 


Theorem 1.1.7 If in addition to the assumptions of Theorem 1.1.6 the assump- 
tions A, for the class KH, extended by the functions ky, - kp, 1,7 = 1,...,p 
and A, (b) for I!(&,) instead of &, are satisfied, then 


£(Vn (G, — 9) | ¢) > N(0, I-10) 


with I(¢) = %(kp, ka) holds for each MLE &:, and hence },, is BAN according to 
Theorem 1.1.4. 


Proof. Because of },, ——> 3 € O™ we can assume without loss of generality 
that &, is a solution of the normal equations 


1 n 
iy Wy, = 95,,(%t)) kg (x) = 0 


nN t=1 


Ms 
~~ 
Oe 
= 
| 
SS 
3 
5 
Se 
S 
5 
© 
Ss 
oF 
icv) 
=} 
© 
rs 
Ze) 
=) 
[e} 
no} 
ae 
io) 
eo 
® 
i) 
z* 
= 
= 
* 
=e 
Ss 
co) 


With B(3’) := a 
follows that if 


3 OD ( 9" x 
0=%5,)- 0042) 6,-9). om! 
Cf a EO 
First we show 
CD89") _ D(H") as. 6 (62) 
oy #=9* on” v=8 


Pa 
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For this it suffices that, for each i,j = 1,..., D, 


1 n 
es py [2 (ue in 9 yx(©1)) Ke gx (1) Keon (2) — Ue) ko,(t) ko,()| 

1 By ee ' 
= FE (Pye = goplerd) — WCE) gg re) ge () 

i n 
a a x L?(e:) (,x(21) Ke gx (1) — ko (xt) ko,(24)) (63) 
and — 

1 n 
5 x [2 (ye ae 9 gx(%)) K gx (1) — UME) Ko ,(«)| (64) 


converge to zero almost surely. 
Because of the uniform continuity of J?) and 9* => # there exists for 
almost all realizations {e,} and for each 7 > 0 an ny with 


12 
oa > (1 (yi 3 9 y#(21)) — 1(e1)) kx (2) ke yw (1) 
Sn|k yal, [box | woe? 7 lho! Ihe 


for all nm = np. Thus the first summand of the right-hand side of (63) converges 
to zero almost surely. A repeated application of Lemma 1.1.2 yields 


us n 
Dd P(e) Kg (X) kg,(Xt) 


nN t=1 


cea 


t=1 


a 8 = NS Spkeg (21) ko,(2t) a+ —*(ke,, ks,) 


N t=1 


uniformly in #, from which 


n 
Dy (é:) Te gx (21) Kegx (1) ==+ —A(ky,, ko,) 


Slr 
p 
Ll 


results and hence 


DUPE) (hap) Kye (2) — eo, (4) ho (er)\ “+ 0.- 


t=1 


zlR 
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Now we consider (64). Because of 


1 n 
ms »L [20 (Ye im 9*(*1) K yx (%) — 1)(&) K5,(%))| =A, + B, 


t=1 

with 

Ay = — S (H(ue — ayqled) — 2M) Kye, 
and 

B, = A DE: 1 (:) (K, aX, « (x 7 K4,(%)) 
as well as 

|A,/? Ss * S (Pu aeG, ox (%)) — YP(e ))? |Ky2 «|, 
and 

© (Us = seqlee)) — WC]? > 0 


we get A, ——> 0. 
At the same time, B,, > 0 is true because, on account of Lemma 1.1.2, 


1 n 
7 DX UP (Es) (Ko(%1) — Ko(x:)) 


t=1 


converges uniformly in # a.s., which proves (62). 


ney 1 (&) Ko,(21) + 0 


nN t=1 
and 
1 n 
— DY PEs) ho, (x1) kp,(x;) 
 t=1 
1 n 
=—) (1?( ) l(€)) ko (%4) ke,(x;) 
1 en 
a+ —(ke, ko,) = —I;,(2) 
imply 
OD()') a.8 
; =r (C) 
OF lye 


and with this, by (62) also 


0B") 
a0" 


"+ —I(¢). (65) 


= 9" 
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Furthermore, applying the results of Bunke and Bunke (1986, theorem 2.4.3 
and Lemma 2.4.2) gives 


£(Vn B(9) | ¢) > N(0, 1(0)), 


from which we conclude with (65) that 


£(Vn (8, — 8) |¢) > N(0, 70). 


Remark 1.1.9 Chanda (1976) also gave the lower bound for the limiting co- 
variance matrix as well as weak consistency and limiting distribution of the 
MLE and the GLSE (without proof). His regularity conditions can not directly 
be compared with our assumptions, e.g. he does not need the compactness of 
the parameter’s range, but on the other hand he assumes the existence of the 
third derivative of the regression function and the uniform boundedness of g» 
and of all its derivatives. 


1.1.7 Asymptotic results for estimators and tests of the variance 


In this section we consider the special adequate nonlinear model 
Yt = Go,(%t) + & p= 12, 
with 
g(x) = x'hg(x), 7 —=(%,p) ¢ IR? xX 
and independently identically distributed errors <;, ¢ = 1,2, ... In the follow- 


ing let A, (a), A;, Ag, A7, and Ag be fulfilled. 
We are interested in the asymptotic behaviour of the residual estimator 


D (ve —99,(e0)? = |y — 95 


In Theorem 1.1.4 strong consistency of 6? was proved by assuming the com- 
pactness of #, A; and A;. The following theorem provides the asymptotic 
normality of 6°. 


Theorem 1.1.8 £{)/n (62 — 02)} + NO, y) with y = Dé. 


Proof. Because of 
= Qutdn) = lela + os, — 95,]% + 290 — 95,2) 


it follows that 


Yn (62 — 0) = 


d ti 2 — o?) ) + Yn |go, — 
Vn (9, — 99°) 


a 
1 2 
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and thus the assertion is established if 
Yn lie yn lg ease 
converges to zero in probability. 
Applying Taylor’ s theorem to the first summand of (66) elds the existence 
of a O* with |}O* — || < |b, — Boll such that 
yn |g 0 4 F =2 Vn (I, — 9)’ (tgs, Jax — 95,),, (67) 


Fy 4.8. 
Because of &* ——+ # and 


94|, + 2 Vn (90, — p> é), (66) 


£{Y/n (5, — )} > N(0, 0%(bo,, ho,)) 
(67) impies that 
Vn |go, 


Furthermore, for the second summand of (66) it again holds, with a suitable 
S*, that 


(68) 


2 Vn (90, — 93> é). == 2 yn (,, — 9)’ (kgs, é), 
and for its convergence in probability to zero, 
P 
(Ege, é),, —>0 (69) 
is sufficient. (69) results from Lemma 1.1.2 and $* ==> 9). my 


Next we prove the asymptotic efficiency of 67. According to this aim P,,; 
is to denote the distribution of 


VY, = (Yi, <-> Ya) under: € =-(9; a. x) €C REX AOR? << hS a 


and we show that the sequence {P,;,} with C, = (8, 0? + nh, %) is conti- 
guous (cf. [A 2.2]) to the sequence {P,,;} for any positive h if XH fulfils some regu- 
larity conditions. For this we assume that x is absolutely continuous with 


respect to a o-finite measure mw, and by L(u) = log <2 (uw) we denote the 
1G 


logarithm of a positive variant of the u-density of U; := &;/o. 


Lemma 1.1.7 Under the assumptions: 


1. L is twice continuously differentiable w.r.t. u 


Lu) = —— Diu), Lu) = Lu); (70) 
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2. EUL®(U)=—1,  — D(UL®(U)) = 1 — B{WL(U), > 0; (71) 


3. There are functions M and J: R1 > R! and a positive number 6* > 0 such 
that, for any 6 with |6| < d* 


|L®(u(1 + 6)) — L®(u)| < J(d) M(u) (72) 
is satisfied with H{M(U) U?} < c and J(6) => O it results that 


£(Lq(¥us Ex) — L(Y ¢) | t) > fae nI(C), #1) (73) 


with I(f) = Z; DIUL®(U)), where L,(Y¥,,¢) denotes the logarithm of the 
density of Y, with respect to pu. 


Remark 1.1.10 With [A 2.3], (73) implies the contiguity of the sequence 

{Pne,} to the sequence {P,,.}. 

Remark 1.1.11 The assumptions 2 and 3 are satisfied if uw is the Lebesgue 
u—>+00 


measure, p(w) = = (w) is positive anywhere and if wp(w) ———> 0 as well 


as u? -- p(u) ———> 0 hold. This can easily be proved by partial integration. 


uU—+co 


Proof of Lemma 1.1.7 We put t = o? and t, = o7. Because of 


Beast Be) | 


t=1 


=—Snlnr+ Ss L(U;) 


t—1) 
it follows that 


. 1 Tt 
L,(Y 5; En) = L(V» a) eS Da [L(U; oe V;) zz: L(U;)] + oy n In Ee 
t=1 n 
(74) 
with 
qil2 — 71/2 
a ae 72 U; 0 
A Taylor expansion around U; up to the second term yields 
UVa) a Age 
t=1 
s Dv L©(U,) + Le + WALO(U, + AV,) =: Aa + B (75) 


where 4, is a random variable withO <4, 31. 
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zs qu2 _ me 
Witha, = ta ee 
1 
gt Teh Ye ig et EO ECT ie ee ee 
2 Tn 7 t=1 2 Tn 
— 1h 1 eases 
and because of Yn a, ——>+ ——— and —nIn eee NA, aS 
i n—->co 9 T 9 ie n 8 72 
obtain 
ea i US 2 
oie tea ee CN (eee D{UL®(U)} (76) 
2 Ga S707 peer 
Now we show 
Bele : ue E(U2L®(U). (77) 
7 
i 1 1 Se eae 
Due to B, = — na® nee U?L®(U, + 4V;) and — naz, — > — — it is 
2 N t=1 2, 8 7? 
sufficient to show that 
+ S UPL U, + 4V;) 2 EPL) (78) 
i 
(78) is equivalent to 
ae *: 
— ©) U{LO(U, + AV,) — L(U,)) + 0 (79) 


NM t=1 


Because of (72), 4, < 1 and a, — => 0 it holds, for sufficiently large n, that 


oe 


 t=1 


1 n 
< max J(i,a,) — Py U?7M(U;) 
M t=1 


1Stsn 


Now we consider arbitrary but fixed parameters (#,x) € R? xX Bx Ko, 
where Hy denotes the subset of H with (70)—(72). Under this assump- 
tion Y, has a distribution in the one-parameter distribution family 
{Prot | 0? € IR*} with Pg: := Pave.c2,x). On account of [A 2.8] and (73) it thus 
follows for any estimator sequence {62} with 62 = 62(Y,) and # (Vn (G2 — o?)| o°) 
-> N(0, S(o?)) for all o? € IR* and fixed # and x that 


S(o*) 2 I(¢)-1 for almost all o?. 


As w.r.t. the family {P,,; | ¢ € &} the class of all asymptotically normally distri- 
buted estimators 67, is contained in the class of all asymptotically normally 
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distributed estimators w.r.t. {Py,: | 6? € IR*} for any fixed # and x, it holds 
for any estimation sequence {62} with 


#(Yn (6 — 0%) |c) > N0, 5D), CE 8, 
that 
8(¢) = 1(6)- for almost all o? and for all (8, 2) <€ OX Hoy (80) 


Now we are able to prove the asymptotic efficiency of 6%. Let H* be the family 
of distributions with the densities 


Dm(u) = c(m) u?™ exp |—(™ + 5 «| (81) 
for arbitrary integers m with 
1 
= for m = 0 
2x 
c(m) = 
for m = 1. 


(2m + 1)m*12 
1-3...+(2m —1) y2n 


Simple computations yield HU =0, HU*?=1, HUL®(U) = —1 and 
D{UL®(U)} = 4m + 2 = 1 — E{U2L°)(U)}. 


Furthermore, 
|L(u(1 as 6)) = L®(u)| = ew =e Relves 
“a wii + 3)? 
2 
< ume ee —:J(8) Mw) 
u 


holds and thus J(6) => 0 and H{U?M(U)} = 1 < oo. Thus Xq is a subset 
of Ko. Additionally, for x € KG, 
4o* 2 


GF ee os AeA FIC Tape 
GT sit aisinatceame ae ame ape. 


holds and hence the lower bound (80) is attained for 67. 
Thus we have proved the following theorem: 


Theorem 1.1.9 The estimator 62 is BAN for all x € Kj. 


Remark 1.1.12 Under normality (x = N(0, 1), m = 0), 6? is MLE for o®. 
But, for all other x € Hj} the exact computation of the MLE is complicated 
even in the most simple case y, = 9 + &,t = 1, 2,...,, because of the non- 
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linearity of the likelihood equations 


2 | ah eee 
AC as Se = aude >y (ye am go(2,))? 2() 
2 Paes A 

1 
eneee 
mu rue AE _ (y = go(x,)) keg (a) == ((), 
t=1 1 (Yi ais (2) oF t=1 


Remark 1.1.13 The equality y = J(¢)-! can be obtained iff €, has a density 
of the form 


Pole) = (0?)~* Ale, c) exp ea 


with a positive constant c and a function h not depending on o?. 
Theorem 1.1.8 admits the construction of asymptotic «-tests (cf. [A 2.16]) 
for the test problem 


Hewa* 0, 8 agaist: A, .o°s= a, 


If x is quasinormal (y = 20+), then the test 


1 if Ze hes 
Pr Y n) = ; (82) 
0 otherwise 
apd 
with Z, = Yn (6y = 66) and u, the «-fractile of N(0,1) is an asymptotic 


V2 05 


«-test since under the hypothesis Hy of Theorem 1.1.8 it follows that 


lim sup Bep,(¥) 
yn 62 yn 2 
= Sn yeup Pe, — o*) + —— (0? — 09) = Us 
y2 ze 2 0% 


cine yo, ae ei 


n—>co V2 2 
2 


LS wae. rye pone 
Furthermore, V,, = = ae eats te 2 = 
V2 9% y2 % 
A(c) > 0 for o% > 63, there is a 6 = d(a) > 0 and an m = no(d) such that 


= :i(c) and because of 


S A(c) — 6 is satisfied for all n => ng. For n = ny this implies 


Ley (Yn) = Py Mee Ae 2 PrtV, S Xo) — 9} 


n 


= PaullVn — Ao) SO} 5 


and thus the test (82) is also consistent for fixed alternatives. 
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In the next theorem we will derive the limiting eecveye of the test stati- 
stic Z, under local alternatives 


K, :0% = 0% + nV (83) 
with h > 0. 


Theorem 1.1.10 For all 6, = (8,02, x) with o% = 02+ nh and x€ Ky 
it holds that 


L(Zq|t,) > Ww (—*-,1). 
2 0 


Proof. According to Lemma 1.1.7 the sequence {P,;} is contiguous to the 


Prop Pere 
sequence {P,,} with Cy = (8, 05, x). Hence nj, acl 0 results from 4, LL 
and thus 


F 1 
£(/n (62 — 0) |6,) =£ {= Eee — 63) + mn | ) > N(0, 204) 
nm t=1 
With (84) and (84) 
3 \/n (62, — 0?) & h 


72 0 /2 0 


the assertion is proved. fl ; 
If x is not quasinormal, the test (82) can not be applied. In this case a con- 
sistent estimator 9, for y is needed. 


Y ((ye — 99,20)? — 62)? ts weakly con- 


t=1 
sistent for y in case that x has a finite moment of 6th order. 


1 
Lemma 1.1.8 The estimator 7, = — 
nN 


Proof. Because of 
‘ by 4 g (z;) 4 _ 6 
“ (ys 95, t ) n 


and 64 + o4, it is sufficient to show that 


3 
3 


1 
+ 6 es ya (go(a1) eetle (21)? g—4—) (go(2) a 94 (71)) & 


N t=1 z N t=1 
+= ¥ (aslo) — 94,(00) (85) 
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converges to #,&} in P, probability. As this is the case for the first summand 
of the right-hand side of (85), we show that all the other summands converge 
to zero in P,; probability. 


n 


Because of a 2S (go(«) a 9, (a))* ae (Vx |go 


nN t=1 


yt ? and (68), 


~ 3 (gale) — 95,(a0))* 7% 0 (86) 
: 


holds. 


follows from 


2 (go(%:) — 9g (2) | S |g 


Next, (86) and 


1 2 1/2 n 1/2 
- — & (oat Xt) = 9s AC) es a » (go(@) = a5,00)') 2 DS e) 


n t1 t=1 
imply 

| a n 

= (Gola) — 9 (x)? €? —*+ 0 (87) 

2 t=1 “4 
and finally 

1 PS OSS 

Be (go(:) — 9% (x) &p > 0 

NM t=1 


results from (86), (87), and 


fe Y (gale) — 95, (a0)° 


t=1 


<a mh a bi »a\l? 
= (SE ned (g0( x) — 9g, (2t)) ] e ne (g0(%) = 99, (%1)) et) Ez 


2 t=1 
Lemma 1.1.8 and Theorem 1.1.8 imply that the statistic 


2 


fis oes Vn (0, =$r 9) 


Vin 
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SE a A 


is asymptotically standard normally distributed under o?. Hence the test ' 


1 PS 0, 
Wn Yn) ie (88) 


0 otherwise 


is an asymptotic «-test and it is a consistent one if x has a finite moment of 6th 
order. With respect to the limit distribution of the test statistics under local 
alternatives (83), the following theorem is valid. 


Theorem 1.1.11 £(T,|¢,) ~ N (= ) for x € Ky. 
Y 


Proof. Because of T,, = Vy/¥, Z, and Theorem 1.1.10, it is sufficient to show 
that 


eo a at 
yy (89) 


(89) results from y cay y and from the contiguity of distribution sequences. I 

Moreover, the asymptotic local efficiency of the tests (82) and (88) is proved 
in Schmidt (1979). 

All results of this section are valid for the special case of the linear model 
gs(z) = «’x without the assumptions A;, Ag, A,, and Ag, (cf. Schmidt, 1979). 
Besides this, the rate of convergence of the distributions of 6? and of the test 
statistics to the corresponding limiting distributions has been derived in that 
case. 


1.1.8 Tests and confidence regions for regression coefficients 
In this section we restrict ourselves to the adequate model 
Yi, = Go(Xz) + & rl PR re 
with identically distributed errors ¢; 
Ee,=0, Ee? = o?, 0€OCR* 
and with 
8 = (9M, 9@’y 9 ERE 
we consider the test problem 
H:30 = 9 = against 3=§. K: 8) + Ht (90) 


that is, the question whether to many parameters have been incorporated 
into the model. 


5* 
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For this an asymptotic «-test can be constructed using the limiting distri- 
bution of the LSE as follows. If 3, = (0, 32)’ is the LSE for #, then we 
would reject H if a suitable normalization of the distance from 6 to d) is 


according to Theorem 1.1.2, because of 82 := 62 —»+ o? and of the 
continuity of B-1(-), 


Zn = 0S52(GY — 9)" BAL(G,) (OM — 99) 


is asymptotically central y?-distributed with k — s degrees of freedom under the 
hypothesis. That means the test 


1 if Zn = vee pad 
AOR : (91) 


0 otherwise 
is an asymptotic «-test for the test problem (90). Moreover, the following 


theorem yields the aysmptotic power of this test for local alternatives 


Kopp ey 


Yn 


Dy — (9), a ae 


with 


Theorem 1.1.12 Let h = (h®’,h®’)’ € IRE with h®e€ R* and & := = 
o 


x AM’ Br (do) h®. Under K,, the test statistic Z, has asymptotically a non- 
central y?-distribution with k — s degrees of freedom and the noncentrality para- 
meter 6* if the assumptions of Theorem 1.1.2, part 1, and Lemma 1.1.6 are satis- 
fied. 

Proof. Let €, = (#,, 07, x) and fy = (9%, o?, x). Because of the contiguity 


A Pat Pat i 
D, —> 9 and S? —-+ o* imply 


Pr, 
$6. —9, and Seo 
A Py, 
In particular, By 1(0-,) —~*> B-1(9,) results from this because of the conti- 
nuity of B(#). 
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» Analogously as in the proof of Theorem 1.1.2, by Taylor expansion and using 
the contiguity one shows that 


Vn (Sn — Bn) = Bo) Vn (ko,, &)n + Op,y (1); 
which implies 

£(/n (Gn — 9) | Cn) > N(h, BU), 
and thus 


£(Bat(8,) Vn (92 — 8) | ¢,) > W(Bg%(9) h, o%ly_«)- 


; Pete : 
From this, because of S2 —»+ o? the assertion follows: 
Under fixed alternatives K: 9M + 94) 


Z, = nSz? |} — gaye 


BS.) 
Len = Py, 
= o* {fn (SY — 0) + Yn (0 — 62) 1,5 + o7,g(1) 2+ 00 
implies the consistency of the test (91). 


Under normal distribution and under the hypothesis one can prove (cf. 
Gallant, 1975) that 


i) 


k—s 


Py, 
TiVo pe arith 2, — +0 


and V,, has a central F-distribution with k — s and n — k degrees of freedom 
for any finite sample size. This could also give rise to the use of (k — 8) Fy_s psa 
instead of 7; ,.,. The resulting test is the usual F’-test in the case of a linear 
regression function. 

A further principle often used in statistics is the concept of likelihood ratio 
test (LRT) introduced by Wald. In the following we consider the LRT for 
the test problem (90) under normality and discuss some asymptotic properties. 
In the adequate model 


YY: =gela) +&, $€=—1,2,... 
with ¢, ~ N(0, o?), for € = (8, o”), let 
Pre aay N(9$ (a), o2I;) 


be the distribution of Y, = (y;,...,yY,). With 2g := {€ | ¢ = (9,07) € 5}, 
the likelihood principle consists in rejecting the hypothesis H if the value of 
the loglikelihood ratio 


Z i n 6°, 
Va EY faye loge 
7} nh) a> SH) (Yn $) 9 08 Ge 
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is to small. Here, as before, Z,(Y,,¢) is the logarithm of the density of Y,, 
under ¢, & = (8’, 6?)’, and éy is the MLE in the model restricted by H. Hence 
it holds that 


bn == (Hy, Oy)’ 
with 

Dn = (OP, 5PY, Fie = Qn(Dn), 
and $% minimizes the least-squares criterion Q,(89, 8) over all possible 
values 3), 


To fix the «-significance point we derive the limiting distribution of the 
test statistic —24 under the hypothesis as well as under local alternatives 


je eee | 


/n 


Theorem 1.1.13 Under the assumptions of Theorem 1.1.2, part 1, and Lemma 
DAG: 


£(—24/on) saa? te-o(9?) 


with Cy, = (8,, 0?) and 6? = ~: AO’ Bo (I) h®. 


(Hence the LRT is asymptotically equivalent to the test based on the limit distri- 
bution of the LSE.) 


Proof. Let 


0 a , , 
G= (;*-) € Messy xte-+19> h = (h’, 0)’ € Rt 


8+1 
and 

1 \ 
— BS) | 0 

o? ( 
(oy Pee = ea 
0 feels 
peers 


Under H, Y, has the limiting information matrix I(f,) := G'T (Co) G and as 
in the proofs of the Theorems 1.1.2 and 1.1.8 one shows that 


_ (62 — 9@ 4 Yn (ky, &)n . 
Vn ( meh G) Gal a ee ia ta Op(L). (92) 


63, — o? 
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Because of 
LAY ;, é) ar DAY Cx) ae (Ga ee y CE ns 6) LY n, oUF ns ©) 
Oo leat 
bee pe OIAY a0) = | 
+ = Vm (§ — x) ar a ena (¢ — ox) 


with a suitable (ae 


GRAY OT aa 
Secrest. 


and with (92) we obtain, with T,, := Yn (— ¢)) and t = : (9, 0?) that 
—22 = Yn (§ — $n)’ Teo) Vm (6 — $x) + op, (1) 


= [T, — GI-4(60) G'I(£o) Tal’ L(S0o) (Ln — GL-4(E0) @'1(60) Tr] 
+ op, (1). 3 (93) 
Finally, the assertion follows from 


B(S) Vn (ko,> €)n 


7 x & — 0°) 
n 


and 7 (7, |'C,) > (h, Y he 1(£o)) as well as from the contiguity 


es = Yn a op, (1), Yn = 


L(V, | Cn) > N(h,I-(O)) 
and thus from (93). 
Remark 1.1.14 If we have no normal distribution, £(~n (§ —£)| t) 
—> N(0, I-1(¢)) with 


can be proved under suitable regularity assumptions as in Theorem 1.1.7 if 
¢ is the MLE for ¢ = (#&, o?)’. 


Thus in the general case, too, the y?-distribution is the limiting distribution 
of the likelihood ratio test statistic. This means that 


tit oA Sie, 


0 otherwise 
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can be applied as an asymptotic «-test. Its asymptotic power for local alter- 
natives can be computed according to (91). In particular, the consistency of 
the LRT follows under fixed alternative parameters, since according to (93) 
and with 


A= [I — I(f) GI-*(2) 4] 1(C) UL — (0) GIN) GY 
—22 = lyn (& — 6) + Yn (6 — bo) 4 + 07, (1) 22> 0 
holds. 
The LRT under normal distribution has also bee investigated by Gallant 
(1975). For finite n he proves that 


tigi 1 ead 


with nv Sant 0 and computes the distribution of V. With that distribution the 
x-fractile of the distribution of 67,/6? is approximated, thus defining an ap- 
proximate LRT. 

The question as to which of the two stated tests should be used can not 
always be answered uniquely. Monte Carlo studies (cf. Gallant, 1975) suggest 
the approximate LRT is preferable to the approximate test based on the 
normal distribution of the LSE if only nonlinear parameters are present. If 
linear parameters enter the regression function, then both the tests have ap- 
proximately the same power and for computational reasons one would prefer 
the test which is based on the approximation of the distribution of the LSE. 

Now we state some confidence regions for #. Because of the well-known cor- 
respondence between «-tests and (1 — «) confidence regions, 


(9 | (9, — 8)’ BI(S,) (G, — 8) < S272.,} (94) 


is obviously a confidence region to the asymptotic level 1 — «. Beside asymp- 
totic considerations here, too, approximations can be utilized for the finite 
sample size. They are based on linearizations. Let ¢«,; ~ N(0, o?). If gs were 
linear in #, then 


(n — k) (0,19) — O92) 
kO,(9,) 


would correspond to the loglikelihood ratio and would possess an F-distri- 
bution. For the general nonlinear case the F-distribution could be used as a 
first approximation and one would obtain the approximate (1 — «) confidence 
region 


(9| On(8) — On(Dn) < kn S2F yn psa} 


proposed by Beale (1960) (up to certain terms of correction). 
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Eventually the form of the region (94) is to complicated. If a Taylor ex- 
pansion up to second order of the left-hand side is applied, then one gets an 
approximate (1 — «) confidence region which was proposed by Bow and 
Coutie (1956): 


(9 | (Fn — 8) OF (Sn) (Hn + 9) < Qhn SF yn p:4) 


” ae e 
P= ( a .(0)). 


Linearization of the function gy by a Taylor expansion of first order implies 
the approximate (1 — «) confidence region 


with 


(9 | (Fn — BY BoD) (in — 8) < USP nese 
proposed by Goldfeld and Quandt (1972). 


1.2: Switching regression models 


1.2.1 Introduction 


Most of the papers on regression analysis start from observation models 
Yt = f(t) + &, t= 1,...,n, 


with a regression function f(x) = h(x, x) with a constant parameter « for all 
points x from a given region %. But, often a cause-and-effect relationship that 
is to be described by the regression function f(x) = A(x, «) is not stable but 
subject to changes. 


Example 1.2.1 As is well known, the volume of a substance reduces under 
increasing pressure. In Eder (1968), a representation of the relative decrease 
in volume y = AV/V, is to be found for some chemical elements for pressures 
up to 9.80665 x 10° Pa (0 < x < 9.80665 x 10°). The curves (Figure 1.2.1) 
for the elements caesium (Cs), barium (Ba), bismuth (Bi), and antimony (Sb) 
have points of discontinuity, which are called pressure fixed points and can be 
determined experimentally. 

This suggests using for bismuth, for instance, a regression function of the 
form 

fa) =Aile,u), vacesy, 1=1,2,.-5, 


with yp = 0, y; = 9.80665. 


Example 1.2.2 The temperature-quantity of heat diagram of water shows the 
behaviour represented in Figure 1.2.2, where y denotes the temperature (°C) 
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Fig. 1.2.1. Compressibility of elements due to Eder (1968, p. 273) 
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Fig. 1.2.2. Temperature-quantity of heat diagram of water 


of the water (of the ice and the steam, respectively), and a the heat absorbed 
(J). Then the relation between y and x is described by a regression function 


Mm, + Myx, 0S Sys 
Cy, 1 StS, 
f(x) = 4 Ms + Nv, Ye SUSys, 
C2» ¥ Sets, 
Ms + N5v, i Se 
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and because of continuity we demand 
M, + yyy = C1 = M3 + Ngyr2, 
M3 + NgVz = Co = Ms + Nsyq. 
Hence we speak of continuous changes of states. 


Example 1.2.3 In semiconductor physics it is well known that the conduc- 
tivity x of a semiconductor having imperfections changes strongly in depen- 
dence on the temperature t. The behaviour of In x depending on x = I/t is 
described according to Paul (1974) by Figure 1.2.3 and consequently by a 
regression function of the form 


f(x) = hia, CAE Mau Stsyvir i= 1, 2, 3, 
with yo = 0, where because of continuity 
Ay (v1, %1) = holy, Xe) and he(Ye, X2) = hg(7Y2, x3) 


has to be satisfied. 


=Inz 


Wf 


Kane 


Fig. 1.2.3. Curve of conductivity in dependence on the temperature ¢ (due to 
Paul, 1974, p. 237, fig. 4.15) 


In chemical engineering it is well known that the properties of objects are 
often subject to long-term changes. The reasons for this fact are uncontrollable 
disturbances such as changing activity of a catalyst, the ageing of the equip- 
ment, impurities of the raw materials, or atmospheric influences. Examples 
and some mathematical models applied in chemical engineering can be found 
for instance in Borodjuk and Lezki (1977). 

Further examples of models with changes of state from quality control, 
chemistry, biology, agriculture, water supply, astronomy, and economy are 
given, e.g., by Barnard (1959), Dunicz (1969), Sprent (1961), Gallant and Fuller 
(1973), Bacon and Watts (1971), Schulze (1977b), Poirier (1973), Fair and 
Jaffee (1972), and McGee and Carleton (1970). 
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In order to cover the case of a change of the regression function in de- 
pendence on outside conditions, we introduce the variable z, where we suppose 
z € [a, b]. In general, we consider (observation) models of the form 


Ys = F(x, 2) + & (1) 
with 
f(1; 21) = Jo(Xt, 21) ae hilar, Os if Yin <% Ss OAs i= 1; sony Ty 


Vou Os Ve 0y Oe = (Os a ay ieee ea) 


and independent random observation errors ¢, with the expectation He, = 0 
for all ¢ and variances 


Dep = 67== 6), Un Vag i ee ee 


Depending on the variable z, the model changes from a state 7 characterized 
by the state function h,(-,-) and the state parameters («;, 0?) to a state i + 1 
characterized by the state function h;,,(-,-) and the state parameters («;,1, 07 eal) 
The change takes place at the change point z = y;. The variable z causing the 
change of state affects the regression function and the distribution of the 
observation errors. The distribution of the errors is arbitrary and z affects the 
variance in the manner mentioned above. Especially z; = ¢ can denote the 
time; then the change of state depends on the time. 

But also z; = x, or 2; = k(x,) can hold and the change occurs in dependence 
on the variable x, as in Examples 1.2.1 to 1.2.3. The literature often treats 
models with z = x € JR! and the condition 


hilyis o%;) = hiss Oi41)> d= 1,..:,7 — 1, 


which guarantees that the state functions are continuously connected (Examp- 
les 1.2.2 and 1.2.3). These models we call models with continuous changes of 
state: 

The state functions h;(x,, «;) are assumed to be known up to their state 
parameter «;, where the «; may be linear or nonlinear parameters. The state 
parameters («;, of), ? = 1,..., 7, and the change points 7,, ..., y, are unknown. 
Except from Section 1.2.6 we assume the state number 7 to be known. 

We aim at presenting a survey on methods of estimating the state para- 
meters, the change points y,,...,7;_; and derived parameters, respectively, 
and on tests to check the stability of the regression function. Many papers 
dealing with these problems have been published: Quandt (1958, 1960, 1972), 
Sprent (1961), Robison (1964), Hudson (1966), Hinkley (1969, 1971), Farley 
and Hinich (1970), Goldfeld and Quandt (1972, 1973), Gallant and Fuller (1973), 
Poirier (1973), Brown, Durbin and Evans (1975), Feder (1975a, b), Schulze 


* yy < 2 is always read as yy Sz; 
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(1977a, b). Nevertheless, we have critically to establish that most of the me- 
thods are based on heuristic principles and many theoretical questions are still 
open. This is especially true for distribution statements on tests for checking 
the state stability, and for investigations and the comparison of the power of 
the different tests. In applying the least squares method, for a lot of models 
practicable algorithms for determining the estimates are still missing so that 
we often have to use approximate estimates. 

This section mainly represents a review with some complements from the 
theoretical point of view on the methods applied in models with changes of 
state,.where several open questions are referred to. First we are going to 
consider models with known 2, ...,2,. We treat the weighted least squares 
estimate of the state parameters «;, 7 = 1,...,7, and of the change points 
Yi» +++» Y, and inquire into some test problems occurring in models with changes 
of state. While Section 1.2.2 deals with the general model (1), Section 1.2.3 
is entirely dedicated to models with continuous changes of state. In Section 
1.2.4 some asymptotic results are summarized. In particular we discuss suffi- 
cient conditions for the consistency of the weighted least squares estimates 
in the adequate as well as in the inadequate model and their relation to the 
conditions described in Section 1.1.5. 

In all sections special emphasis is placed on models with linear state func- 
tions h;(x, «;) = «hi(x), 7 = 1,.. .,7, and on models with linear state functions 
which differ only in their parameters «;: 


hi(x, «;) = «jh(x), Bi cacoal (ibe Shea a 


In Section 1.2.5 we consider briefly some more special models with changes of 
state traeted in the literature. Among them are models with random changes 
of state. Finally, in Section 1.2.6 we refer to some methods to identify changes 
of state in models with an unknown state number r. 

Let us introduce some general assumptions and notation: 


x,€ X CRY, 2, € [a,b] =:Z—R! 


a; € A; CR", m= lb ope 


hj(-,-): & XR? > R}, eae eae ae 

yi € [c, d] — [a, b] =: Z, Diels Ta Oy Ge aaa b; 
Vi, os 4+) Prat) Cl = (an --o Yr-a) (Mi [dh via V 
E=2 5.0.7 —1}. 


The vector y of the change points we will also call the change point. : 
Tf & = (24, 21), +++» (ns Zn)) denotes an experimental design with the points 
(21, 2), t= 1,...,”, then I* denotes the set of all admissible change points 
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Y = (V1) +++» Yr_i)’ Satisfying the condition that there exists at least one ob- 
servation for each state, i.e., for each 7 = 1,...,7 there is a ¢ € {1,..., nm} with 


Vi-a < 2% < Vi (Yo := G, Yr := 4). 


Then 
O:= (a, y ECA x =: 6 


and 
Tr 


Jo(a, 2) = 2 h,(x, «;) Ly, ya(2)- 
i=1 
With regard to the variances we assume o = (0;,..., 0,)’ € (IR*)’. Moreover, 
WO MSS SY. 0i0s Ya eg Eat = (Sp sins FER) 


1.2.2 Ordered models with abrupt state switching 
1.2.2.1 The model 


Let us assume there are yj, ..., y, independent observations for (1) according 
to an experimental design § = ((2, By ys ewe hae Zn)) satisfying without loss 
of generality the condition z, S 2 <... <2,. In this case we use the term 


‘ordered model’. Let y = (71, ---; ¥p-1)’ € J*. Then there exist natural numbers 
Mm, = my) < ... <M, = m(y;-1) such that 


y= hy (x4; X) + &; with Dé, = Oo; fori if eeey My, 


Yt == Ng( Le, X) + & with Dé, — Oo for ¢ = m + Ae; Mg 5 (2) 


Y; = h, (x4, %r) + &; with De, = o? for t = m,.. + 1,...,” 


and 
He; = 0, eee ee 7 


The change of state always takes place between the points (%,,,2,) and 
(%m,+1> %m,+1)> Where m; is the number of the last observation uf the ith state. 
Thus, the m;,7 = 1,...,7 — 1, are defined by 


m, = m(y;) = max {t| a S yi} (3) 


and they are called change indices. The vector m = m(y) = (m(71), - +05 (77-1) 
we will also call the change index. 

Obviously the change index m(y) depends on the choosen experimental 
design §. For fixed experimental design, each change point y € J* uniquely 
determines a change index m(y) given by (3). Hence we can define the set of 
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admissible change indices 
AM = mI") = {m(y) | y € I}. 


Ay (5 ey Ye) CT with em, SVi <2mar, t= 1,...,7 —1, have the 
same change index m = (mj, ..., m,-,)’. Generally, for a given experimental 
design é, the parameter y € I‘ will not be uniquely defined by the expectations 
H(%, 2), ¢ = 1,..., n. But, often at least the change index m = m(y) is uniquely 
determined by the experimental design, as the following example will show. 


Example 1.2.4 Let 


be, ey 
= fee oe 


y € [1, 3/2], 
Gz + box, y< a, 


1 2 20 
= — — —— j —— 
and é (3): Go) sae fa) .» Then it. follows’ that — = [1, 3/2]; 
M* = {10, 11, ..., 15}, and y is not uniquely identifiable from f(x;),f = 1, ..., n. 
But the change index m = m(y) € .@ is uniquely determined if only at least 
one change of state takes place, i.e. (a,, b,) + (ae, 62) is valid. 


Obviously, the vector of the values of the regression function on the experi- 
mental design & and the covariance matrix of the vector of all observations 
y = (Y,, ---» Yn) depend on the parameter y only via the corresponding change 
index m = m/(y). So we can write 

(x1, 21) a Jo(Xt, 21) =e Jays 21) Ses I(a,m(yy)(Xt> Zt), t= 1, sony N, 
and 

L(y) = Z(m(y)) € Via» 
where 


Ue = {2'(m) od Diag [opt n,3 Olea sey Olen | o 
= (0,,...,0,)' € (R*)}. 
Under the assumptions made so far we have 


Yt = Ioa,my(Xt> 2) + > t= eect; He=0, De = X(m) (4) 


with « € A, X(m) € Um, m € MH. 
In the case of linear state functions h;(x, «;) = «jh,(x), we introduce the 
r-block matrix 


H(m) = Diag [H,(m), H,(m), ...» H,(m)] 


with m = m(y), where the ith matrix H;(m) denotes the design matrix with 
the rows hi(x;), = mia +1,..-, mi, (mM := myo) = ma) = 0, m, = my) 
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= m(b) = n). The model (4) has the representation 


y=HH(m)ao+e, He=0, De=X(m) H 
(5) 
aeA,  3m)eEVm, me AM. 


‘For linear state functions «;h;(x) we always suppose (if no other assumption is 
made) that «; € 4; = R” and « € 4 = R?, respectively. 

If the change index m is known, (5) describes a linear model. But if m is 
unknown, it is clearly a nonlinear model. The demand that given an experi- 
mental design é for each m € @ the state parameters a, ..., x, are uniquely 
determined by f(2;, 21) = 9o(%, 2) is equivalent to 


r[H(m)] = p for all m € 4; (6) 


hence it is required that « is identifiable in each model with fixed m. This is 
equivalent to the existence of an unbiased linear estimate of « in the model 


y= Him) n+ 2, He—0, Ds = Xm), «RP, Bin) € Va AD 


with fixed m € 4. 
In the model from Example 1.2.4 the condition (6) trivially holds. 


1.2.2.2 Least squares estimators 


A weighted least squares estimator (WLSE) } = (4',9’)' is defined to be any 
solution of the minimization problem 
“ly — gal, = min “ly — gol, 
BE OE 
with 


"ly — gol, = 21D wi(yr — Jo(%t, z))? 


t=1 
(cf. Definition 1.1.1). 

Concerning the weighting w, = w(z;) we assume that it depends on the change 
point y only via the induced change index m(y). This is fulfilled in the case of 
a GLSE (w; = 07? for m(y?_,) << tS m(y’), i = 1, ..., 7) and an OLSE (w, = 1). 
Then also 


Tr mM, 
“ly —gola = 22>, L wilye — Ailes, ;))? (amo := 0, m, = nn) 
i=1 t=m4+1 
depends on y only via the corresponding change index m = my), and the 
WLSE @ is not uniquely determined. Let m = (m,, ..., 7,_;)' be a WLSE of 
m = m(y); then each p = (f,,...,Pr-1)' € L* with m(p) = m is a WLSE of y. 
The condition m()) = m is equivalent to }; € [Zm,, 2a)? = 1,...7 — 1. 
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A WLSE # = (4’,’)’ can be calculated in two steps. First for each fixed 
m € M* the WLSE &,, := (&{(m), ..., &;(m))' as a solution of the minimization 
problems 


m, mM, 
ene dye ned h(x, &,(m))}? a oy a wey —hi(x,, oxi))? = :8;(m) 


and the corresponding residual sum S(m) := y S,(m) = n° |y¢ 
t=1 
lated. Afterwards an 7 € M* is determined providing a minimal residual sum 


with respect to m€ MAM. (&, mh’) and (84,9), 9 = (Pi, -+) Dra)’ with 9; 
€ [Za Za)» = 1,...,7 —1, are WLSEs for («’, m’)’ and 3 = (a’, y’), res 
spectively. A WLSE $ exists if &,, exists for all m € M>. The existence of &,,’ 
is for instance ensured if the state functions h;(x, «;) are continuous on X X 4; 
and if the 4; are compact sets. 

Since the number of elements in A‘ in even simple models is mostly very 
great, the determination of m causes much computational effort. Moreover, in 
the case of nonlinear state functions h;(x, «;) iterative techniques for deter- 
mining the &;(m) are used. In the case of linear state functions h;(x, «;) = «;h;(x) 
with «; € A; = R®, &, = (&{(m), ..., d;(m))’ exists for all m € M* and hence 


mil, are calcu- 


Yi(™) = (Ynys +++ Ym,)’ and Wj(m) = Diag [Wm, +1) +++ Wm,I- 


In the case r[H;(m)] = p;, 1 = 1,...,7r, and w, > 0 for all ¢=1,...,”, &m 
is an unbiased estimator of « in the model (7) with fixed change index m. 

The computational effort for determining the &;(m) and m can be reduced 
by using ee techniques. For models with linear state functions «;h(2), 
i=1,...,7r, Guthery (1974), starting from Bellmann’s optimization principle 
and using the updating technique, suggested a dynamic program to compute an 
OLSE #$ which reasonably reduces the computational effort. 

For application and for asymptotic investigations it is useful to assume lower 
bounds for the state lengths y; — y;-;, ie. to demand that y; — yi. 2 6; > 0 
for given boundaries 6;,7 = 1, ..., 7, and to introduce the corresponding para- 
meter spaces 


ee ey ens ot) 6 he Yr Vine aay eae | amid BY 
M:= Yel lyi—ya2otH=L.,.7 Hol, 
0,:=AXT, 50,8 := AX cA X=: 6 
ME = {mly) |v € 3} CAE With Gn ==t1Ogsie: 507) 


For 6 = 0, 65 = 0, My = MM’. 


6 Nonlinear Regression 
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The WLSE with respect to @§ can also be calculated in two steps. It has 
properties similar to those of the WLSE with respect to O°. We will denote 
it by 8°. Formally, $ = 8° holds. If max 6; < min {|z,; — 2.4! | 2 + 24} 

iSisr 2StSn 
is valid, it obviously follows that @* = O§ and consequently § = 8°. 

Since we do not know the true state lengths y? — yf, (é= 1,...,r), we 
want to choose the 6;, ¢ = 1, ..., 7, as small as possible. On the other hand, 
large 6; may be of advantage for the numerical determination of 8°. If one 0; 
is choosen too large, ie. if y? — yf_, < 6; for an 7 € {1,..., r}, then the true 
parameter y° does not belong to the parameter space J}, and the model be- 
comes inadequate. 

In Section 1.2.4 we shall refer to sufficient conditions for the consistency of 


and } (Theorems 1.2.1, 1.2.2 and Remarks 1.2.3 and 1.2.4). 


1.2.2.3 Testing the presence of a state switching 


First we want to compare the states in two groups 4 and B of observations, 
where we assume for simplicity that 
7 


y: = a'(A) Ae) + &, ted, a(A) € IR, 
y =a (Bh(m)+&, te B, a(B)eR*®, 


with independently V(0, o*) distributed s,. Hence, the regression functions 
(state functions) of the groups 4 and B have the same form and differ at most 
in their (state) parameters «(4) and a(B). These assumptions lead to the linear 


model 
Ya H(A) i 0 ' x(A) 
y = |--] = |----j—-—--] |----- +8 (8) 
Ys O : AB) \ a(B) 


with ¢ ~ N(0, o7J,), nm = n(A) + n(B). 

For any set 4 let n(d) denote the sample size of the group 4, y, the vector of 
observations y; from A, and H(A) the design matrix of the group 4 with the 
rows h'(x,), § € A. 

A change of state occurs iff a(d) = a(B), and the test problem concerned in 
the model (8) is 


H: «(A) = a(B) against K: «(A) + «(B) 
or equivalently, using «a = (3) : 
H: (Ip,i — Ip.) « =0 against K:(I,,} —Ip)) «+0. (9) 


This test problem is testable iff 
Pp, = 7[H(A)] = 7[ A(B)] (10) 


— a | 
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The theory of testing linear hypotheses (cf. Bunke and Bunke, 1986) provides 
the test statistic 


FU, By) — MA) +B) — 2p, 
Pi 
. H(A) (A) — H(A) (A u B)|? + ||H(B) &(B) — H(B) aA u B)IP 
Ya — H(A) &(A)|? + |lys — A(B) &(B)\? 


Fyn A)-+0(B)—20, > (11) 


H 
where &(A), &(B) and &(A u B) denote the OLSE based on the observations of 
the group A, B and the union A u B. |lz|| is defined as the Euclidean norm 
llell = (22)? 

The notation ‘q’ used in (11) means that under the hypothesis H the con- 
sidered statistic has the given distribution. The hypothesis H is rejected if 
F(A, B) = Fy, 4) 40B)-2p, 18 valid. The «-test constructed from (11) is 
equivalent to the likelihood ratio test. Under the assumption (10) the statistic 
testing (9), which was given in Bunke and Bunke (1986, p. 269), coincides with 
the statistic F(A, B) in (11). If the condition (10) is fulfilled for all sufficiently 
large sample sizes n(A) and n(B), and if 

Aminl[H"(A) H(A)|aqsurO = and = Amin H"(B) H(B)] aaa? © 
then according to Bunke and Bunke (1986, Theorem 5.2.4), the sequence of 
tests basing on (11) is consistent. Chow (1960) and Quandt (1960) suggested 
further tests for checking (15) based in intuitive principles. 


Remark 1.2.1 Here we admit different variances o7(A) and o?(B) for the 
observations in A and B, respectively. If we consider the test problem H: «(A) 
= a(B) against K: «(A) + «(B), i.e. if we are only interested in the change of 
the parameter of the state function, then we have a generalized Behrens- 
Fisher problem. For the special case h(x) = 1 it is identical with the Behrens- 
Fisher problem. Toyoda (1974) and Schmidt and Sickles (1977) investigated 
the distribution of the statistic F(A, B) in (11) under H: «(A) = a(B). A 
compact representation of this distribution was not found. Schmidt and Sickles 
derived an exact formula for the probability P{F(A, B) = f}, which depends 
in a complicated manner on the ratio o?(B)/o?(A). They especially showed that 


P{F(A, B) 2 Fra sp..maytnBy—2p) < % (12) 


may hold for «(A) = «(B) and o?(A) + o(B) and thus the level of significance 
« is not kept exactly. 

Yayatissa (1977) constructed a statistic having an F-distribution under 
a(A) = o(B) and o%(A) + 07(B) so that the test based on this statistic exactly 
keeps the level of significance. On the other hand, we have a change of state 
in the model (8) with variances o?(A) and o?(B) iff «(A) = «(B) or o*(A) + 0°(B) 


6* 
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holds, hence it might be useful to consider the ‘extended’ test problem 
H: «(A) = «(B), o%(A) = o%(B) against K: «(A) + o«(B) or 
o%(A) ++ 0B). (13) 
Under the extended hypothesis, the test statistic F(A, B) of (11) has the F- 
distribution given above so that it provides an exact «-test for checking (13). 
Because of (12) it is indeed not unbiased. 

Now we return to our observation model (5), where we assume the «; to be 
normally distributed. We restrict ourselves to models with at most r= 2 
linear state functions h,(x, «;) = oh(x), i = 1, 2. If the change index m is 
known, then the problem 


H: There is no change of state during the observation t = 1,...,” 
against 
K,: The change of state takes place exactly after the mth observation 


can be checked with the test statistic F(A, B) (11) by setting 
Ae and B := {m+ 1,...,n}. (14) 


The resulting test statistic we denote by T'n. 
Usually just the change index m is unknown and the test problem which is 
of interest is 


H: There is no change of state during the observations ¢ = 1, ..., 
against (15) 


K: There exists in m € * so that the change of state occurs exactly after the 
mth observation. 


Then obviously K = U Ky, holds. 
mem?® 
Under the assumption 


r[H(m)] = 29, for all m€ AM: (16) 
the likelihood ratio statistic 2 for testing (15) is 


AS A Aa 
memé 
where 2,, and m denote the likelihood ratio statistic for checking H against 
K, and the MLE of m, respectively. The distribution of Az is unknown. 
Empirical investigations by Quandt (1960) showed that —2 In 2, asymptoti- 
cally has no central y?-distribution for n — oo. 
The application of the union-intersection principle proposed by Roy (1953) 
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provides a further possibility for testing (15). It leads to accepting the hypo- 
thesis H iff all individual tests for checking H against K,, accept H. Under the 
assumption (16) this principle leads to the acceptance of the hypothesis / iff 


re De a 3P1, nm A)+n(B)—2p, 
meMm= 


holds, where the statistics T,,, are defined by (11) and (14). It | M*| denotes the 
number of elements in M', then |M| « is an upper bound for the level of signi- 
ficance of this test. 

Referring to the above property of the likelihood ratio statistic, Quandt 
(1960) proposed to check the problem H against K = U K,, with a test that 

meme 

is used for testing H against K,,,, where mp is to be ee ‘properly’. He 
proposed to use my = mm (m is the MLE of m), or 


[> if n is even 
2 
My = 
—1 1 
He epee ’ if nm is ood. 
2 2 


Some considerations concerning the goodness of such tests are to found in 
Quandt (1960). There are no general theoretical statements on the power. But it 
is clear that the power heavily depends on the distance of the used mp) from 
the true change index m°. 

Other tests, which are independent of the choice of an mo, can be obtained 
by the following idea. The test problem (15) has the representation 


Hf: y com Ho re €,€ ~ N(O, ol.) Oy € IR?:, o € IRt* 


against ee 
K:y = H,x, + ZB + &,@ ~ N(0, o7S,), «, € R”, 
B = (a — o%) € IR” \ {0}, oF € Rt, 


ZEZ(M*) := ee Im € a, 
2 


where H,, denotes the design matrix under H with the rows h'(x,),¢ = 1,...,m, 
m + 1,...,n, and H,(m) the design matrix with the rows h’(x;),t = m + 1,...,n. 

We propose to check the problem (17) with the tests for model testing des- 
cribed, e.g. in Ramsey (1969) and Thalheim (1977) (RESET — Regression 
Specification Error Test, RASET — Rank Specification Error Test, KOMSET 
— Kolmogorov’s Specification. Error Test, BAMSET — Bartlett’s M Speci- 
fication Error Test). These tests are used to test regression specification errors 
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of the form described in the alternative K in (17), but where Z € May p, \ {Onxp,} 
is assumed, i.e. these tests are used to check the hypothesis H in (17) against 
a greater alternative than XK in (17). Besides deviations from the model by chan- 
ges of state, this greater alternative also includes other forms of deviations 
from H in (17), e.g. the lack of essential variables in a stable model. If those 
tests lead to a rejection of the hypothesis H in (17), it can only be concluded 
that we have a deviation from the model, but without additional information it 
can not be concluded that this deviation from the model consists in a change 
of state. Since the mentioned tests are not specific for models with changes of 
state, we renounce a detailed representation. We only refer to Thalheim (1977), 
where also the necessary tables are contained. The tests obtained in this way 
are based on — in a certain sense — optimally linear unbiased residuals 
(Thalheim). 

Two further tests for checking the constancy of a state, which are based on 
normed recursive residuals, where proposed by Brown, Durbin, and Evans 
(1975). They started from the model 


Yr = ah(xy) + &, & ~ N(0, of), t= 1,...,” 
o, € R?, of € R* 


which allows changes of state after each observation. They developed so-called 
cumulative sum tests to check the hypothesis 


EPs te oe eet Pipe eee Sige 3 
Bi = 6 50, Opi Oper ce 
These tests are based on the normed recursive residuals 


yr — (&(t — 1))' h(a) 
vo, = —__,, Gf Me bn antOp 
Lick (h(a) [H7_1H,_,}* h(x) 


which are independently N(0, o?)-distributed under the hypothesis. Here 
&(t — 1), (&(¢ — 1))’ A(a,) and H,_, are the OLSE of the regression coefficient, 
the prediction of y, from the first t — 1 observations and the design matrix 
with the rows (h(2))’; i= 1,...,t — 1. The matrices H,_, are assumed to be of 
full rank. 
Let 
n n 

S(n) = YO (yi — (&(m))' W(x)? = Xv? and 6%(n) = (n — p)-2 S(n) 
=1 t=p+l1 
be a consistent estimate of o?. The simple cumulative sum test is based on the 
random variables 


2 Vr» (ia Nr eel i OM 
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and the quadratic cumulative sum test on the random variables 
0=—— 3 
=— > vv, == eer (88 
r s(n) ot t Pp “ p) 2” 


The hypothesis H is accepted iff 


|W,| S 2a(r — p) (n — p)-¥?2 + a(n — p)-¥2 for allr =p +1,...,n 


(n,3a¥n-p) 
| 
| 
| 
{pt+l,ayn-p) | 
| | 
lai sac bk ESN caeOS SA 
pt 
| 
(p+\;aYn-p) 


n 
| 
| 
| 
| 
| 
| 
| 


Fig. 1.2.4. Region of acceptance 
(n-3avn-p) for the simple cumulative sum test 


G(r) 
ie ae Fig. 1.2.5. Region of acceptance for 
the quadratic cumulative sum test 


and 
UC Sera 
DUS 


<c forallr—p+1.,...,n, respectively, 


Q, — 


ie. if all W, lie between the straight lines running through the points (p +1, 


ayn — p) and (n, 3a Vn p) and (p +1, —a Vn — p) and (n, —3a Yn _ p) 
(cf. Figure 1.2.4) and if all Q, lie between the lines running parallel to the 


P in the distance c, respectively (cf. Figure 1.2.5). 


straight line g(r) = — 
Nn — 


The parameters a and c, respectively, have to be chosen in such a way that the 
desired level of significance « is obtained approximately. 
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For the following «-values the a-values were given as follows: 


el O01 0.05 0.10 
a| 1.143 0.948 0.850 


To determine c = c(a,n — p) an approximate method was proposed. There 
have not yet been any theoretical statements on the power of these cumulative 
sum tests. A simulation study by Garbade (1977) for the case p = 1 seems to 
point to the quadratic cumulative sum test being better than the simple one. 

Hackl (1980) proposes a modification of the two cumulative sum tests. 
Instead of W, and Q, he starts from moving sums of recursive residuals 
(MOSUMs) 


1 a 
Wie = PS Ut» = 1b cia k, >” 
me 6(n) pee 
and 
1 Tr 
0% ius P= phy ah 


— xO 
Or x t=r—kt+1 


Tk n 
with the variance estimator 6°, = (n — p— me ( SHE el vi). 
t=ptl t=r+1 
where k is a given natural number 0 < k < n — p. Here for each time r we 


sum only the last & recursive residuals v; The hypothesis of constancy is 
accepted iff 


\Wry| Sma) forall r=p-k, 2 
and 
W%) SQrnn Sul) forall r=p+h,....n, 


respectively, where the critical values m(«), q(x) and q,(a«) are chosen in such 
a way that the resulting test is of level x. The book of Hackl gives a fine review 
of various tests for testing constancy. He also introduces various modifications 
of this tests based on moving sums of recursive residuals. Using simulation 
experiments he makes a comparison of the power of various test in rejecting 
a false hypothesis. He concluded that this MOSUM test based on W,, is more 
powerful than the simple cumulative sum test based on W,. On the other hand, 
his investigations indicate that the power of the quadratic cumulative sum 
test (Q,) exceeds the power of his quadratic MOSUM test (Q,,). For more 
details we refer to the very interesting book of Hackl (1980). 

In connection with the cumulative sum tests let us still refer to a paper by 
MacNeill (1978), who investigated asymptotic distributions of cumulative 
sums in a model with two polynomials as state functions and equidistant ob- 
servations, which are based on the usual residuals y, + &'(t) h(2;). 
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1.2.3 Models with continuous state switching 
1.2.3.1 The model 


Without restriction of generality we now assume that in model (1) 2, = a, € R}, 
for the experimental design & = (a, ..., 2), 


tp Sts Sst SH, 


and that additionally the conditions 


hilyis &3) = hisr(Yis Oi), +=1,....7—1, (18) 
hold. 
Then the state functions and the change point y € I? define the regression 
function 


Kx) = go(x) =Ai(w,a;) for yin SerSy, t=1,...,7. 


Additional to (18), continuity conditions to the first 1; derivatives of the re- 
gression function g(x) = g(x, 8) with respect to x at the points y;, 7 = 1,..., 
r — 1, are sometimes also demanded. This class of models includes for instance 
models with continuously connected polynomials as state functions. These 
models play an important role in many applications. The reason for this is the 
relatively simple form of the functions and their good properties of fitting, 
which are well known in spline theory (cf. Ahlberg, Nilson and Walsh, 1967). 
For this type of model the term ‘spline regression’ is used and many papers 
have dealt with it. Here we only refer to Poirier (1973), Wold (1974), Ertel 
and Fowlkes (1976), Buse and Lim (1977), Park (1978), Jupp (1978), Agarwal 
and Studden (1978) and Dathe and Miiller (1980). 

For each fixed y € I let W(y) denote the set of all « € 4 for which g(x) 
= YJa,(%) satisfies all continuity conditions. Then the parameter 3 = («’, y’)’ 
varies in the space 


OF := {V(y) X ty} | y € I}. 
Hence we have a model of the form 

y= 9(a,my)(%t) ar &, t= MR ere OF Ke = 0, De = 2 (m(y)) 
with « € V(y), 2(m(y)) € Vy v € I. 


The notion of identifiability of « given y € J corresponds to the identifiability 
of « given m = m/(y) in Section 1.2.2.1. This means that 


Feary Bt) = Jory), €=A1,..,0, 1,02 € Wy) 


implies x, = a». 
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In contrast to the model with abrupt changes of state, not only may the 
change index m(y) be identifiable with respect to a suitable experimental 
design, but, because of the couditions, also the change point y itself. 

Let us return to Example 1.2.4 with two straight lines as state functions 
and a continuous change of state, ie. a, + byy = az + bay holds. Under the 
assumption that (a, b,) + (do, bz), the parameter « = (a, b,, dy, bg)’ is iden- 
tifiable from the observations and y is uniquely determined by y = (a, — a2)/ 
(b, — 6,). Consequently it makes sense to estimate not only m(y), but y itself. 

Now we turn to models with linear state functions 


h(x, xi) = ohi(x:) , (a; € IR?*), eA ote 


The continuity conditions imposed on the regression function f(%) = g9(x) 
can be represented by 

Cra a! 

— a;h;(x) SaaS O54 Whiss(2) > (20) 

Gar! z=», On! m= 
Fa OD elit Sk, Seat fo de 
The equation (20) can equivalently be described with a matrix A(y) by A(y) « 
= 0, and V(y) = V (A(y)). The model has the representation 


y=Al(my))«o+e, He=0, De=2Z(mly)), Bi 
ae N\Aly)),  2mly))€ Vins vel. 


Given y € I the parameter « is identifiable and hence A(y)-conditionally 
unbiased linearly estimable iff 


Salle aie 
( A(y) = panes By 


(Concerning the notion of A(y)-conditional unbiasedness, see Bunke and Bunke, 
1986). 


1.2.3.2 Least squares estimators 


As in models without continuity conditions [Section 1.2.2) a WLSE $ can be 
computed as } = (&;, ")’ in two steps. For each fixed y, &, is a (restricted) 
weighted least squares estimator with respect to the parameter space V(y). 
) minimizes the residual sums of squares S(y) = n”|y — I(2,,7)|n defined by 4, 
with respect to y € I¥. 

For models with nonlinear state functions only iterative methods will be 
applicable for the computation of the WLSE &,. In the case of linear state 


functions &, is given by 


&, = OP(y) H'(m(y)) Wy 
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where the matrix C'?(y) is defined by the relation 


( a Ae) is ( | ay 


and the matrix H (m(y)) Ci(y) H ‘(m(y)) W is a projection matrix into the space 
H(m(y)) N(Al(y)) with respect to the norm “|\z|| = (z/Wz)¥2 (ef. [A 1.8]). If 


H 
r Bey) = p is valid, 


&, = |H'(m(y)) WH(m(y)) + Ay) Ay) | H'(m(y)) Wy 
follows by [A 1.9]. 


Remark 1.2.2 Vf x, + y; holds for allt = 1,...,n andi = 1,...,r — 1, then 
the mapping of the observation x, on the state functions is uniquely determined. 
But, if there exists a point x, with x, = y;for ani € {1,..., 7 — 1}, then because 
of the continuity condition on the regression function at the point y; we can 
allocate the observation in the point a, = y; to the state function «{h;(x) as 
well as to the state function 0}, hi.,(a). Thus there are two design matrices 
iH (y) and j,,1(y). If & and &*7 denote the corresponding WLSE, we can show 
because of ,H(y) V(A(y)) = issH(y) W(A(y)) that & and 4+ provide the same 
residual sums of squares 


“ly — Hy) 8 |, = “ly — Hy) ih. 


Now we turn to the problems how to compute the WLSE #. If I’? = {y}, 
the WLSE is given by (&;,, 74)’. A model with a known change point yo is 
treated by Poirier (1973) and Buse and Lim (1977), who consider a model with 
continuously connected polynomials of third degree and moreover demand the 
continuity of the first and second derivatives with respect to x of the regression 
function. But a model with a known change point yp is an exception. Generally 
most of the computational effort arises as a result of the variation of y in I”. 

For a finite set I% = {y®, ..., y} we can compute ? and hence a WLSE é 
immediately, although with great computational effort if v is large. Often 
the parameter space is a set with infinitely many elements. A general algorithm 
to determine @ for an infinite J* is not known. 

Hudson (1966) investigated the determination of an OLSE & for a compact 
parameter space J* in models with continuously connected linear state func- 
tions without further continuity conditions on the derivatives. He obtained 
some important analytical properties, which allow a reduction of the com- 
putational effort for determining an OLSE &. For the special case of a model 
with two continuously connected straight lines we get a very simple practicable 
algorithm for determining an OLSE & (Hudson, 1966; Hinkley, 1969, 1971). 
But Hudson’s method does not lead to an applicable algorithm in every model. 
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In models where possibly also the derivatives of the state functions may be 
continuously connected in the state points, we have to resort under consi- 
deration of Hudson’s results in certain regions to an approximate determination 
of » which is based on a given finite net [1% <I of y-values. In general, an 
approximate WLSE #, will be used, where the continuous space J™ is ap- 
proximated by a discrete finite set / and where 8, is a solution of 


=min min*|y — Tas\e 


2 
n A 
yel® ae (y) 


“ly — IZ, 


Another possible way to determine a WLSE #@ consists in using a repara- 
metrization 8 = 7(#) based on the continuity conditions and the determina- 
tion of the WLSE f of f in the reparametrized model. To determine the WLSE 
B we will use iterative methods. Afterwards, } will be computed as the solution 
of 8 = 7(8). The method was used by Gallant and Fuller (1973) to compute an 
OLSE # in a model with r piecewise continuously connected polynomials with 
continuous first and second derivatives. Furthermore, they gave sufficient 
conditions for the convergence of the used modified Gauss-Newton iteration 
algorithm. Iterative techniques for determining an OLSE in models with 
continuously connected polynomials are also investigated by Jupp (1978).* 

As in the model with abrupt changes of state we can introduce minimal state 
lengths 6;, 7 = 1,...,7, and the corresponding parameter space 05 = {8 = (x, y) | 
lvi — vial S 6, 7= 1,...,7,0€ Vy), yer}. A WLSE # with respect 
to @§ has similar properties to a WLSE # (= 8°) with respect to 6°. 

Some asymptotic properties of 3° and }, especially some sufficient conditions 
for their consistency, will be investigated in Section 1.2.4. 


1.2.3.3 Some testing problems 


As in Section 1.2.2.3 we assume &, to be normally distributed with a constant 
variance o?. We confine our attention to models with r = 2 linear state func- 
tions «,h(x,), 1 = 1, 2, which only differ in the parameter. 


(a) Known change point 


We consider the model 


e ~ N(0, o7!,), aE NM(Ay)); o2 € Rt. 


* An approximate method for calculating least squares estimates of the model para- 
meters in models with changes was proposed by Bunke and Schulze (1984). This 
method is based on a differentiable approximation of the segmented regression func- 
tion and allows the use of the well-known iterative techniques for calculating least 
squares estimates in nonlinear regression. 
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As in Section 1.2.2 we investigate the test problem 
LT ees against AR Ae: Ur 


e we H(m(y)) ; 
Under the testability condition r | | -~—-—- = p = 2p,, the linear hypo- 


thesis testing provides the test statistic 


_ fe |WPvy —PxylP? fe lly — Pxyl? — lly — Peyl? 
Ky) = = — SE OP 
a h ly—Psyl? ft lly — Pyll? Popes 


where Py and Px denote projection matrices into the spaces 


H 
f= H(my)) M(Aly)) and 3% = a ( ------~ ), 
f, and f, are defined by 
H,(m 
fy 2p, — [Aly] = ( ee ee ) and fz=n — 2p, + 7[A(y)]. 


(b) Known change index m = m(y) 


Now we suppose that the change point y is unknown, but that we have addi- 
tional information about y with the change indes m = m(y). We define 


Fin = fy € FF | m= m(y)} 
and obtain the model 
y= H(m)«+ 6, ¢ ~ N(0, o°,,) 
(22) 
a € W(Aly)), vy € Tiny, GPCR: 


Since y is unknown, Sprent (1961) proposes not to test the problem H: a, = a 
against K: «, + a» in the model (22), but in the larger model 
=H E, ¢ ~ N(0, oI,), 
y = H(m) a + ( ) (23) 
« € R?, a7 é IRT. 


Thus we do not use the continuity conditions and we can apply the test (11) 
from Section 1.2.2.3. 


We are also interested whether in the model (22) the change takes place in a 
given point yo € Iim- The corresponding test problem 
H:y=V%0 against Kase Vo 


is not testable in the model (22). Instead of the hypothesis H: y = yo in the 
model (22), Sprent (1961) considers the linear hypothesis H: «a € V (A(y0)) 
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in the larger model (23). Under the testability assumption (A (yo))’ € me((H (m))'), 
this procedure leads to the test statistic 


n — r[H(m)] lly —PHmr(aonyl? = ly = Pay? 
Gap Fp ayy) n—1r(H(m)]* 
r[A (yo) ] lly — Puumyll? Hf ' 


Furthermore, Sprent (1961) still inquires into corresponding test problems 
occurring when simultaneously observing several regression functions with 
changes of state. 


(c) No additional information about y 


In the case of an unknown change point y € J* the literature usually deals 
with tests in the model with two continuously connected straight lines and 
identically N(0, o?)-distributed observation errors. 

In this special model Hinkley (1969, 1971) investigates the likelihood ratio 
tests for testing 


Ho: Oy = Xo against Ke: Oy == Xo 
and 
Hy Vi =" 6 against Ky Ve ye: 


Under the assumption 7[H(m(y))] = 4 for all y ¢ I the problem Hy against 
Ky is testable. The problem H, against K, is testable only under further 
assumptions. Under the assumptions 


r[H(m(y))} =4 forallye Fé 
and 
N(A(y)) i MN(A(yo)) = {ax = (01, og)" | % = og} 


for all ye I* with py + yo 


H, against K, is testable in the model that is restricted by the condition 
&, + &». The use of the restricted model only means that the case of no change 
of state is excluded. But the restriction is of no importance in the computation 
of the likelihood ratio statistic since (I, ! — I) &, + 0 holds with probability 1 
for all y € J*. The likelihood ratio tests for testing Hy and H,, respectively, 
lead to the statistics 


1 


Ay = aa (S — S(d)) 
and 
1 
A, = - (S(70) — S(%)) 


respectively, where S(yo) = S(G,,, Yo), S(¥) = S(@, 7) = S(&) and S denote 
the residual sums of the OLSE 4@,,, @; and the OLSE & in the model without 
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changes of state. Hinkley (1969) pointed out that A, is asymptotically ?- 
distributed under H, and he conjectured that A, is asymptotically 3-distri- 
buted under Ho. He proposed replacing the unknown variance o? in Ay and A, 
by a consistent 7?-distributed estimator, and to approximate the distributions 
of the obtained test statistics by corresponding F’-distributions. 

Another asymptotic test for testing Ho using a partial Bayes approach was 
given by Farley and Hinich (1970). Restrictively they supposed that the 
change of state takes place in exactly one of the points 2, ...,2, and that 
each point may be the change point with the same probability 1/n. 

A general test problem in a more general model was considered by Feder 
(1975b). In the model with r continuously connected linear state functions 
and independently and identically but arbitrarily distributed errors ¢, he 
considered tests to check 


H: 3 € Og against K:8€@,:=@0 62, 


where Oy is a subspace of O with certain properties. As test statistic he used 


A n/ 
Ags ae ‘3 
S(O) 
where S(dz) and S(,) denote the residual sums of an OLSE of # under H 
and K, respectively. He gave sufficient conditions for the asymptotic ?- 
distribution of —2 In 4. For a model with two intersecting straight lines with 


idependently and identically distributed observation errors, it follows in 
particular that the corresponding statistics —2 In 4, and —2 In A, for testing 


Hy: 7 = Yo against = Ky: y + 7 
and 
Hy: & = (01, %2, )’ = (10, 20, Yo)’ = Fo against Ky: 3 + I 


asymptotically have a yj- or yj-distribution under the corresponding hypo- 
theses. Asymptotically the commonly used statistics 


—2 In A, = n (In S(y9) — In S(¥)) 
and 
—2 In A, = n (In S(p) — In (4) 


have the same distributions. 

By an example Feder demonstrated that the asymptotic distribution of 
—21n4 may deviate from a central 7?-distribution if # is not identifiable. In 
the model with two intersecting straight lines the change point y and con- 
sequently the parameter # are nct identifiable under Hy: a, = 2, 80 that the 


asymptotic distribution of —2In 4 = n(InS — In S(9)) under Hy is not 
known up to now. 
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1.2.4 Some asymptotic results on least squares estimators 
in ordered models with state switching 


In this section we want to derive sufficient conditions for the consistency of 
the weighted least squares estimators in the models with abrupt as well as with 
continuous changes of state. The results and the technique of proof are closely 
related to the results in Section 1.1.5 on the consistency of a WLSE in general 
nonlinear models. We will show that some essential assumptions from Section 
1.1.5 are not fulfilled for the models with changes of state considered here, so 
that the following statements about consistency can not be immediately deri- 
ved from Theorem 1.1.1. 

Analogously to Section 1.1.5 we will split the state parameters «;,7 = 1,..., 7, 
into their linear and nonlinear parts, where we will denote the linear parts by 
«x; and the nonlinear ones by 7; (a; > (aj), ni))s i.e. for the state function we 
have hj(x, «;, n;) = «;h(x, n;), 7 = 1,...,7r. In this connection the following 
notation is introduced: 


a; € IR?, ni €¢ H;—R™, Ae 1 ee 

Pe shes acy Ode Ge 7 = (mM «+5 Mp)’ € H = X H;, 
6 =(i. 7 EXT =:8, z 
Ooo 75 Pee Be Gk RY SERS TS 

6, := {PE OlyEeT;}, Of == 9 CO Lye Tt, 
ere (FeO ye I}, 


where the index » indicates here and in the following the dependence on the 
sample size n. Then the function g9(z, z) has the representation 


9o(t, z) = a'ha(x,z) with 
h(x, z) = h(x, z, n, y) 
——— (hy(2, 1) 1 oe (z), sey hi (a, Nr) Tiy,_.0\(2))' : 


As in Section 1.1.5, we allow the inadequacy of the model, i.e. the unknown 
regression function /(x, z) isto be approximated by a piecewise function gg(2,z). 
The WLSE #*% is defined to be a solution of 


Q,(8°) = min Q,(3) 
deokn 
with 
n 
Q,(8) = “ly — gol, = 2d wt"(ye — go(X, 2)? 
— 


i 


where ,, = 8° holds (i.e. 6 = 0,) (cf. Section 1.2.2.2). 
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In contrast to the assumptions of Theorem 1.1.1 in Section 1.1.5: 


e The parameter space O§* is not constant, but varies with the experimental 
design £,; 05" — 0; — 0. 
e Ifr > 2, then the parameter space 


T= {y= (Vis +--+ Ya) [CS SUES Pe iy PH 


is not compact and thus the space #@ = H xX I’ is not compact; it holds that 
Bee ie, and 4;°—=— HX Tyo 2 Cyc HX Ly, and Ty (6 = 0) 
and I) are compact. 

e The central assumption of the continuity of h(x, z) in B for a fixed x and z 
(Ag) is violated in general, as the following example will show. 


We consider a model with two linear state functions and z = x. Then 
ga(x) = a’h,(x) holds with 


h,(x) Lia,y\(%) : 


as ( Teale) Ly (2) 


If h,(x) = 0, 7 = 1,2, then h,(x) for any fixed x is discontinuous in y at the 
point y = z. 

Although essential assumptions from Theorem 1.1.1 in section 1.1.5 are 
thus violated, the consistency of the WLSE can be proved under similar con- 
ditions and with a similar technique as in Theorem 1.1.1. First we formulate 
the following assumptions: 


1. Let a function «: 4 > R! exist that has at most a finite number of points 
of discontinuity with 0 < x < u(z) Se for allz¢€ & and 
S, = max |w” — u(ztn)| PELE), 
1Stsn 


2. e, are independent random variables with He, = 0, 


r 
2 
He; = 0 = Dy OL yy, ya) 
s— 1 


(a) & are identically distributed with of = o? or 


(b) the 4th moments satisfy the condition }) "He! < oo. 
t=1 


3. The sequences of empirical distribution functions 
n 


F(x, 2) BA a va | ares (cot el eine Ztn) 
t=1 


yen 2) Le hen) 


t=1 


7 Nonlinear Regression 
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defined by the sequences of experimental designs Aen a ((%1n5 a Baas 
(25. Say) )t converge for all (a, z) € X X & to distribution functions F(a, z) 
and F(z) over  X & or &, and F(z) is continuous and strongly increasing 
on [c, d]. 

4. The regression function f(x, z) is piecewise continuous and bounded, i.e. 
there exists a finite number of points 7% :=a<1t<...<1,:=5), so 
that f(x, z) is continuous and bounded on & X (7;_;, 7] for alli = 1,..., ». 

5. Let go(v, z) = a’hg(x, z) hold with 


hg(z, z) = h(x, 2, n, y) 
= (hy(x, 1) Lia,y,3(2)» hy(x, 2) Lys yal(2)s teey h(x, Nr) Tiy,-1,0\(2))’ > 


and let all components h;;(7, 7;), 7 = 1,..., pi of hi(x, ni) be continuous on 
the compact sets 2 x H;,1 = 1,...,7. For all 6 = (n, y) € H X TI, let 


LRN ( fue he n(X, 2) hayy(x, 2) dF (a, Dee 
] 


LER i= Tene 


be a non singular matrix. 
5’. Let « = z and g(x) = a’h,(x) hold with 


hg(x) = h(x, n, y) = (hi(x, m1) atia), teey hy(x, Nr) Ty,-4,b\(2))' 


and all components h;;(x, ;) of hi(x, n;) be continuous on the compact sets 
[a,b] X H;, += 1,...,r. The continuity conditions on the function g9(x) 
can be described by a matrix A(8) by A(8) « = 0 with the property that 
the matrix A(f) is continuous in Bf € H < I’ and the matrix 


%(hg, hg) + A’(B) A(B) 


$—1) cD 
=((f J alee) hacay(a) rary() ue) +A") AB) 


Jal ,0..01?) 


is non singular for all B = (n,vy)€¢ HX I. 
6. (a) Let a unique solution 0% of 


min “|f — gol? = “lf — go5/? 
BE OS 


exist. 
(b) Let a unique solution # of 


min“|f — go|?’ = *|f — gos|? 
BEO 


exist. 
7. There is a J € 6 with f = gp. 
8. For all 3, 8 € 6, “Igy — g5|? = 0 iff 0 = B. 


@ 
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Here (J, k), and %(, k), are defined as in Section 1.1.5, and “(1, k) := f u(z) 
4 LxXd 
x U(x, z) k(x, z) dF (x, z). Generally we use the notation introduced in Section 


1.1.5. The assumption 5’ is related to the model with continuous changes of 
state and is an analogue of the assumption 5, which concerns models with 
abrupt changes of state. 

The assumption 1 allows weights w that may depend on the observations 
and the measuring points z,. The assumption is trivially fulfilled for the 
OILSE and the GILSE. 

From 2(b) it follows in particular that 

min of < of = maxo? < oo forall ¢=1,2,... 
1S<isr 1s<i<r 


and under the assumptions 1-3 with the theorem of Helly-Bray it follows that 


n r mr(v?) 
m1) ule) of = > cif 3 we) 


t=mn(y?_,)+1 
a Dg a? f u(z) dF n (z)= Faape xe ci ee =: Ty(y°) 


so that the assumption A, (a) or (b) from Theorem 1.1.1 is satisfied, where 
we give up the modified Lindeberg condition, which is not necessary for the 
consistency. 

For all &= («’,x’,y’)/ €0 let 6(#) =y,;—yin, t= 1,...,7 and 4d(8) 
= (61(9), eo 6,(8))’ be the state lengths given by y. For 6; = (6), ..., dir)’ ds 
= (691, ---, Oo)’ let 6, S dy or 6, < 62 be defined componentwise by 6,; S 63;, 
Pires hie Se STON! O14 Ons = Agaety T 

The next theorem provides the consistency for the WILSE 5° in the case of 
abrupt change of state. 


Theorem 1.2.1 
1. Under assumptions 1, 2 (a) or (b), 3-5, we have 


(a) lim “lf — Yya|, = lim “lf — 93| = A? as. 
n—>0o = n—>00 ; 
with A? := min “|f — gol, 
BEO5 
(b) af there exists a OF € 6 with 
“lf — goe| = min “|f — gol =: 4 
0€0 


at follows that 
= Ay a.s. 


lim “lf = 935 
n—>0o 


— lim “lf rao, 745 
for all 6 with 0 < 6 < 6; := 6(8;) (consistency of the WILSA). 


[* 
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2. Under 1, 2 (a) or (b), 3-5, we have 


(a) under 6 (a), oe + Hw, 
(b) under 6 (b), 3° + 9 for all 5 with 0 <6 S 6, := 5(9) (consistency 
of the WILSE). 
3. Under 1, 2 (a) or (b), 3-5 and 7, we have 


Ai = Ay = 0 and 


yi 


Qn (9%) > ty (y = de tf ule) a 


Ye 


for all 6 with 0 < 6 S do := 4(9) 
(consistency of the variance estimate). 
4. Under 1, 2 (a) or (b), 3-5, 7 and 8, we have 


5° ==+ 9 for all 6 with 0 < 6 S< dy := 5(9) 


(consistency of the WLSE). 
Proof. The proof essentially follows the ideas of the proof of Theorem 1.1.1. 


Therefore we can restrict ourselves here to a short sketch of the differences. 
Let the sets #) and # be defined by 
Be = HX ly Vande He t= {fhe t 1. Ds Be aie 


Since 


(1, by = mS (ey) Urs 1) Klee, 22) = f wlz) Ux, 2) bw, 2) AP lz, 2) 
t=1 LEZ 


we can show by means of an appropriate splitting of the integrals, the theorem 
by Polya, and by using the compactness of #) that 


sup ["(Z, k)n *: “¢; k)| Br 0 
Lke# 


which according to Lemma 1.1.1 yields 
sup |"(1, k), — “(I k)| + 0. (24) 
lke 

Now (24) implies 
sup || (aca) —h5aln — “hey — havin +0, t= 1,...,p, 
BEB : 

and we can show further that 


“Iheciy — hg] >0 for B>B(EB), i=1,...,p. (26) 
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The statement of Lemma 1.1.2, 


sup “(l, ) —> 0 (27) 
le KH 


remains true, but because of the missing continuity property of he(a,z) (Ag) 
it must be proved in another way. As in the proof of Lemma 1.1.2 we show: 


“(1,),——>0 forall le # (28) 
and 

“ela tA) (29) 
Exploiting 


"(hays €)nl S “\haay — Agy|n “lEln + “hg iys €)n 
S|" Agi) —hzala — “hay — hgy|| mleln 
+B “\heiy — hz| “lelat “(h B(i)> E)n > 


(25), (26), (28), (29) and the compactness of 4p, it follows similarly as in the 
proof of Lemma 1.1.2 that 


sup “(hgiy, €) “+0 forany i=1,...,p, 

BEB 
from which we can derive the property (27) with the same technique as in the 
proof of Lemma 1.1.2. 

Further, we can show that “(h;,h;) and “(hg, f) are continuous in B € Bp 
= H X I. This immediately yields that og := (hg, hg) + “(hg, f) is continuous 
inB€ @=H x I. This statement can be shown only for #, but not for A 
since “(hg hg) 1 for B = (n, y) with y € Ig \ I does not exist. But we may , 
introduce ag = “(hg,hg)* “(hg,f) and & = (hg, he)n “(hp, fn. Thereby & = &(, my) 
holds. If y;_, = y; is valid, «;(8) = &;(8) = 0 follows for the ith subvectors of 
og and &,, respectively, so that 


sup ||@s — «|| = sup ||, — all ——+ 0 
BEB» pe B 
can be shown with the same technique as in Lemma 1.1.3. The continuity of 
op yoplies the continuity of d(B) := "|f — aghs| on # so that there exists a 
Bi = (nf, vi) € ®s = H XI with d(6s) = min d(6). Assumption 3 implies 
BEBS 
Filybs) — Falvise) > Fly's) — Fh) > 0 


and thus it follows that yfe I» and Q,(85) < Q,(01) for almost all ». With 
sup ||«x,|| < oo and the same arguments as in the proof of Theorem 1.1.1, the 


BEBs 
statements 1 (a) and 2 (a) follow. If there exists a # € 6 with 6, := 6(9,) > 0 
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and "|f — gos| = ae “If — gel, it immediately follows that Ae := min “|f — gol 
0E05 


= A, for alld < 6, aed consequently the statements 1 (a) and 2 (b) imply the 
statements 1 (b) and 2 (b). The remaining statements are consequences of 1 
and 2 


If the assumption 5 is replaced by the assumption 5’, then the continuity of 

== ["(hp, hg) + A’(B) A(B)P! “hz, f) in B € # follows from the continuity of 

“ll, he) and “(hg, f). With the same technique of proof as in Theorem 1.2.1, the 

following theorem of consistency of the WILSE a in the case of continuous 
changes of state can be shown. 


Theorem 1.2.2 Under the assumptions 1 to 4, 5', 6 to 8 and x = 2, the statements 
from Theorem 1.2.1 hold. 


Remark 1.2.3 Since the continuity of «g, could only be shown for 6 € # 
=HxTy, but not for B € ® := H X I, we do not succeed in showing 


sup ||«g|| < co. Thus we can not prove the consistency of 5, = 5° with the 
BEB 
applied technique. But for the special case r = 2 we have J’= I) = I; with 


6 = (c —a,b —d)’ and thus the consistency statements of Theorems 1.2.1 
and 1.2.2 are valid for 3, = 3° = 9, too. 

In case the matrix “(hg hg) + A'(B) A(B) is not singular even for all B € By 
and if the identifiability condition 8 holds not only on 0, but also on 9, the 
consistency statements are also valid for 5b, = 5° in the model with continuous 
changes of state and an arbitrary state number r. 


Remark 1.2.4 Provided that the linear parameters «; vary in compact sets 
A;, ti =1,...,r, we can similarly show that 5° and >, = $° follow the con- 
sistency statements of the Theorems 1.2.1 and 1.2.2. 

For models with 7 continuously connected linear state functions, Feder 
(1975a) investigated sufficient conditions for the weak consistency of the 
OLSE &:,. He did not use the minimal state lengths 6;, but established a con- 
dition on the sequence of experimental designs {é,} which ensures that, with a 
probability converging to 1, the OLSE ©, lies in a compact set containing the 
‘true’ parameter % (assumption (*) and lemma 3.4 in Feder, 1975a). Moreover, 
he investigated the speed of convergence of the OLSE and derived sufficient 
conditions for the asymptotic normality of 4, (1975a, theorems 4.13 and 4.17). 
From his results it follows that in the model with two intersecting straight 
lines and a homogeneous variance n¥?(¥ — y°) is asymptotically normally 
distributed. According to Hinkley (1969), empirical studies showed that the 
normal distribution for finite sample size n provides a bad approximation for 
the distribution of 7,. Therefore he suggested alternative approximations. 
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1.2.5 Some other models with state switching 


In the short description of other models considered in the literature we can 
restrict ourselves to models with r = 2 linear state functions «{h,(x) and 
x3h.(x) with normally distributed observation errors. Goldfeld and Quandt 
(1972) treat a model with z, = k’s, where k is an unknown vector and s; an 
observable or nonobservable variable. In particular, s, = x, may hold. Since 
z, = k’s, is not known, we can not turn to an ordered model because we do not 
know which observations belong to which state. These assumptions lead to a 
model where more unknown parameters than observations occur, so that the 
parameters are not identifiable. That is why Goldfeld and Quandt (1972) suggest 
an approximate maximum likelihood estimate (D-method) of the state para- 
meters 0, %2, 01, 0, and of the parameters k and y describing the mechanism of 
assignment between states and observations. The method is based on a re- 
duction of the number of the unknown parameters. 

Besides models with a deterministic character of assigning the states and 
observations, Quandt (1972), Goldfeld and Quandt (1973) and Quandt and 
Ramsey (1978) consider models with a stochastic mechafiism of assignment. 
In the observation model with two states they introduce the state probabilities 
Ay and A, = 1 —Ay, t = 0,...,, where /,, denotes the probability that the 
system under consideration is in the first state at the time ¢ (i.e. y, = «h,(2;) 
+ ¢,, De, = o? holds), and /,, denotes the corresponding probability for the 
second state. 

Two special cases are investigated: 


(aV4 == A ator all ¢=- 0; 1... 


(b) A, = (=) = ( "4 ie a) a A Pe a 6) a een Py 


Ate Pi T2 At_i2 


1 3 oy may be interpreted as a Markovian tran- 
— Te T2 

sition matrix, where 1 — 1, denotes the probability that the model changes 
from state 1 at the time ¢ — 1 to state 2 at the time ¢ and 1 — rt, analogously 
denotes the transitionprobability from the state 2 to the state 1. In order to 
estimate the state parameters «,, 2, 07, 63 and the probabilities 2 and Ay, 1% 
and 2, respectively, characterizing the state mechanism, Goldfeld and Quandt 
(1972, 1973) propose maximum likelihood estimators. For the special case (a) 
Quandt and Ramsey (1978) derive a further estimate based on the moment 
generating function. Under certain regularity assumptions this estimator is 
consistent and asymptotically normally distributed. Bayesian methods for 
estimating the state parameters and the change index m in a model with two 
straight lines as state functions when using an improper a priori distribution 
were treated by Ferreira (1975) and Holbert and Broemeling (1977). An appro- 


The matrix 7 = ( 
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ximate Bayesian method to estimate the change point in the model with two 
continuously connected straight lines was introduced by Bacon and Watts 
(1971). 


1.2.6 Methods of identification of state switching in models 
with unknown number of states 


Now we consider the model (1) with linear state functions «;h(x), where the 
changes of state take place in dependence on « (i.e. z= k(x)) and where the 
number of states 7 is unknown. The same considerations as in Section 1.2.2 
lead to the observation model 


Yt = ajh(a,) + & De, = oj, i GcAg, (a 2 Uren ar 
with 
A, = (m +1, m4 + 2,..., my}, eas seg er 


unknown change indices 


Fr ee ae +=1,...,7—1 
1Stsn 
and an unknown number of states r from a given set 2. 
Finding an estimate r by the least squares method as a solution of 


n 
S(?) = min S(r) with S(r) := S(#(r)) = min DY (y: — go(a))? 
reR eco§ t=1 
demands a considerable computational effort. Here 6; denotes the parameter 
space chosen for the experimental design é and r states. For example, in a 
model with n = 20 observations, p, = 2-dimensional state parameters «; 


and the assumption that there are at least three observations within each 
n 


state, as many as 406 residual sums S(#) = &> (ye — 9'5(4))? have to be com- 
t=1 

pared. In the case of n = 25 observations, the number of residual sums to be 

computed increases to 2745. 

Obviously S(r) is a decreasing function in 7, so that independent of the true 
number of states an OLSE leads to a large value ?. 

McGee and Carleton (1970) and Schulze (1977b) suggested other empirical 
methods to estimate the number of states r, the change indices, and the state 
parameters §;, oj. These methods are mainly based on the use of the test (11) 
developed in Section 1.2.2.3 for the comparison of the states in two groups of 
observations. This test is applied to decide whether a given observation group 
and its adjacent group can be combined or not. Of course the methods depend 
on the chosen level of significance « of the tests used. The greater « is chosen, 
the sooner the hypothesis about the equality of the states in the considered 
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observation groups is rejected and the more states are found. Theoretical 
statements on the power of such empirical methods do not exist and are also 
difficult to obtain because of the complicated structure of the methods. A 
comparison of the various suggested methods based on empirical investigations 
is still missing. 

Another method to estimate r based on Mallows’ C,-statistics was suggested 
by Ertel and Fowlkes (1976). Halpern (1973) considered a model with an un- 
known number of states r and continuously connected linear state functions. He 
describes a Bayesian method to identify changes of state. The most restric- 
tive assumption of his model is that the changes of state may only occur in 
finitely many given points 9,, ..., p,, where it is not known in how many and 
which of these points. Starting from prior distributions in an appropriate 
reparametrized model, he derives conditional posterior distributions for the 
state parameters given the change point y and the posterior distribution of the 
change points. Besides the estimation of the change point y by maximization 
of the posterior distribution, he investigates the problem of optimal prediction. 


1.3 Some topics in nonparametric regression 


1.3.1 Introduction 


In this section we review some aspects and recent results in the theory of non- 
parametric estimation of regression functions. Consider observations 


y, = {(2;) +8, 7 ie orey Ny (1) 
where ¢, ..., €, are independent identically distributed random variables with 
Ke, = 0, = 1 ey Oe 


The regression function f is measured in the design points x;, which belong to 
some interval & of the real line. In the preceding sections it was mostly assumed 
that the regression function f was in a class {fy | ® € 9}, where @ was a subset 
of some finite-dimensional space. Even though the actual model was allowed 
to deviate from this class, estimation methods were parametric, i.e., amounted to 
estimating a parameter of fixed finite dimension. If the parameter space is 
assumed to be infinite dimensional, then the model is said to be nonparametric, 
abbreviated NP hereafter. 

The theory of NP regression is intimately connected with other models of 
curve estimation, notably with the estimation of probability densities. We will 
neither explore all these interconnections nor attempt to survey the field of NP 
regression as a whole. We will focus on the decision-theoretic aspects of esti- 
mation. In particular, we will be concerned with asymptotic optimality of NP 
regression estimators. For small-sample results on minimax and Bayes esti- 
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mators of the regression functions we refer to the papers of Bunke (1985) and 
Van der Linde (1985). 

Let us now introduce some further model assumptions and notation. We 
will regard f itself as the parameter to be estimated, and suppose 


fed, 


where F is some subset of a function space. The set % is supposed to be a 
compact interval in R; let & = [0, 1] hereafter. 

Let y = (Y,, -7-, Yn)’ be the data vector. The set § = {x,,..., x,} is referred 
to as the regression design, and it is assumed that 7, Sa, <...< a,. Let 
f° = (f(«;));=1,...... In general we will be concerned with the case where é is 
nonrandom, given, and becomes dense in [0, 1] as n — ov, in a sense to be 
specified later. Some results to be reviewed concern experimental planning 
in the present context, where é can be selected prior to estimation. 

In the asymptotic framework we will admit design sequences such that 
{a, ...) Ly} is not a subset of {x,,..., 2}. A specific role will be played by the 
sequence of uniform designs {(7 — 1)/(n — 1),7 = 1,..., m}. For all notation 
we adopt the convention that the dependence subscript » can be dropped. 

A sizeable part of the literature on NP regression deals with the case where 
(x;, y;) are independent random pairs distributed like (X, Y), and where f(x) 
= H(Y/X = x). We remark only that with regard to asymptotic optimality 
of estimators, results mostly parallel those for nonrandom &. 

The particularities caused by the infinite dimensionality of *, which we 
are primarily interested in, persist if the error variables ¢; follow a simple 
statistical model. Thus we will always assume ¢; ~ N(0, 1), 7 = 1,..., n. 

With regard to the choice of the loss function, a distinction can be made 
between estimating f at a point and estimating f globally. If one is interested 
in the value of f at a given point x, the loss will be, e.g. 


F(x) — f(a)|? 


for an estimator /. For global estimation one considers a norm ||-|| of some func- 
tional space and a loss which is a function of ||f — fl]. The norm \|-|| will be 
selected among the norms of L, type, 1S p S oo. Let 


1 1/p 
inl = (f uted)” 1Sp<oo, - |Mflloo = essup (f(z)| 
0 


z€(0,1] 


(where essup f(x) = int | sup f(a) | w,(J2) = 0h). 


aeA z¢€A\M 


L, is the associated Banach space of (equivalence classes of) real functions on 
[0, 1]. 


/ 
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As an estimator of f for sample size n we will admit any measurable function 
f:[0, 1] x R® +R. The risk will be defined (temporarily) as 


Raf f) = Eyal — fll 


or, alternatively, by substituting | f(x) — f(x)| for ||f — {lp above. Consider the 
supremal risk of an estimator 


Onl ts F) = sup Rh f) 
SEF 


and the minimax risk at stage n, 


A,(F) = inf o,(f, F). (2) 


f 


A sequence {/,,} such that Cull ns F) will be close to 4,(F) in some sense, for 
n —> oo, is called an asymptotically minimax (AM) estimator. The concept of 
AM optimality has proved its usefulness in parametric and NP statistical 
models as well as in robustness theory. One of its principal merits is that it 
allows the treatment of optimality in the class of all estimators, dispensing 
with restrictions such as asymptotic normality. For parametric statistical 
models, the classical local AM theorem of Hajek (1973) gave rise to an extensive 
development. Such local AM optimality statements can also be obtained for 
parametric nonlinear regression; in fact, this topic is covered in part by general 
results of Ibragimov and Khasminski (1981). 

A result of this type, in the parametric case, would be of the following form. 
Suppose F = {fy | # € O}, where @ is an open subset of IR*. We assume that O 
is compact, and that #;,7 = 1, ..., k are the Fourier coefficients of f with respect 
to some orthonormal basis {@;}jeqy of LZ. Suppose that é is the uniform design 
of size n, and that }:, isthe MLE of 8. Under standard regularity conditions one 
has 


£(n¥2(H,, — 9) | 0) > NO, Vo), 0€ O (3) 


where V is the asymptotic covariance matrix. Typically V» is of the form 


1 
v5 = [ ele) (Ate a. 
a8 oo 
0 

Since #; are the Fourier coefficients, 

Ofo() : 

—— = 9,(z), So Phyl OF Ve =, 0 eo. 

ay P;( ) ) 0 
Under some additional regularity conditions, (3) implies 


nEs ||P, — |? > tr Vo = ke. (4) 
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Also, since {g;} is an orthonormal basis, for in = fs, 
lifn — fle = lb, — OP. (5) 


Since © is compact, it is also a typical result that the convergence (4) holds 
uniformly over # € O. Then (4) and (5) imply (for p = 2) 


Onlin» F) > k. 


The corresponding version of Hajek’s AM theorem, stating asymptotic effi- 
ciency of f,, would be 


liminf n/,(F) =k. 
The latter two relations describe the solution of the asymptotic efficiency 
problem in this simple parametric case: 


lim n4,(F) =k (6) 
n 
This equality, which can be written 4,(F) = n*k(1 -- o(1)), n —> co, specifies 
n-1 as the rate of convergence of the minimax risk to zero. But it contains more, 
viz. also the constant k describing the exact asymptotics. 

Consider now a nonparametric model. An infinite-dimensional set * which 
is sufficiently rich will, for each k, contain a set F;, isomorphic to a compact 
subset of IR*. It follows that 
Consequently, n~1 is no longer the rate of convergence of the minimax risk. 
Note that the relation (7), which was derived for squared L,-risk (p = 2), holds 
also for 1 < p S w and for the risk at a point. Now, to solve the asymptotic 
efficiency problem in the nonparametric case, the behaviour of 4,(F) has to be 
specified. A first question is whether 4,(*)— 0. For smoothness classes * 
such as 


W(1, 2, L) = {f|f absol. continuous, lif + lif < 2} 


this is the case, as will be clarified later. To see which classes of ¥ can 
basically qualify, we refer to results of [bragimov and Khasminski (1980a). 
These authors studied a continuous time analogue of NP regression (see (25) 
below), and found that for 4,(7)—0 (in the case p = 2) it is necessary 
that F has some compactness property in the space L,. Classes like W(1, 2, L) 
are compact in this sense for L < oo, while the class of all differentiable f 
and also the unit ball in Z, are not. In fact, these results concern: the exi- 
stence of uniformly consistent estimators of /: 


AE Pr allifs —fa > OQ] el), noo, VE>O0. 
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In parametric theory, the role of compactness conditions with uniform con- 
sistency results is well known. 

Once 4,(F) = o(1) has been clarified, interest centres on the rate of con- 
vergence to zero. For describing rates of convergence, we introduce the follow- 
ing notation. Two sequences {a,}, {b,} are weakly equivalent, or a, ~ by, if 


Cr aa,/0, <-C, for nC, 


for some positive constants C;. Also b, is called a rate of convergence of ap. 
We note that there are results in NP regression where n-! is the rate of con- 
vergence of the risk. This is the case, for instance, if the loss is defined in terms 


Zz 
of the function F(x) = f f(t) dé, as e.g. || — FI). A candidate for an optimal 
0 


estimator is the stochastic process 


VQ ys sakes (oa bh 
jsnz : 

since n1/? (Y(~) — F(z)) has a limiting distribution (at least if € is uniform). 
The problem is very similar to that of estimating a distribution function (see 
e.g. Millar, 1979), and the theory of limits of experiments applies in analogy 
to the parametric case. Millar (1982) treated the NP regression model for a 
related loss function, where also n/2-consistency applies. Earlier related 
references are Makowski (1974) and Beran (1982). Observe that the norms 
||| are stronger than ||F(-)||,. Though the operation /— F is continuous 
in L,, inverting it would be incorrect in a statistical context since the inverse 
is not continuous. This sheds some more light on the slower rate of conver- 
gence for the norm ||-||,. 

The latter consideration also reveals a relationship to the theory of ‘ill- 
posed’ or ‘incorrect’ problems in analysis. We mention the work of Fedotov 
(1981, 1982) exploring this aspect of statistical curve estimation. 


1.3.2 Optimal rates of convergence 


At this point it is appropriate to mention the connection of the present topic 
with the theory of the approximation of functions. In this branch of mathe- 
matics, many methods have been developed for the task of recovering a function 
f from data f(x;),7 = 1, ..., n, or from noisy data. Among these are linear inter- 
polation methods, which basically assume non-noisy data. In the NP regression 
model, linear interpolation algorithms are of interest in the case of replicated 
observations, where the error in each point of observation can be made small. 
If the design is not a replicated one, linear smoothing methods are appropriate. 
Consider a family of linear operators 


{Sus € IN, 7 > 0} 
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where each S,,, assigns a function on [0, 1] to a data vector y € IR”. Here r is 
a real parameter which describes the degree of smoothing, such that large r 
correspond to little smoothing. For instance, S,,, could be a kernel estimator 
with bandwidth 7-1, or a truncated Fourier series of length r fitted to the data 
y. More precisely, if {y;};<qy is an orthonormal system in Le, f; = (9;, /), one can 
form empirical Fourier coefficients 


and set 


Sry) = LHW) 7. 


j=l 


The risk evaluation for such estimators is analogous to the one for orthogonal 
series estimators of densities. Uusually, one knows from approximation 
theory that, for f € F, the operator S,,,({*) reproduces f with a certain accuracy: 


lf — Sarl = Or"),  n,r->0co, B>0, 
uniformly over # if r = o(n). Obviously, 

E.Sun r(Y) (©) = Surf) (@), — w € [0, 1]. 
On the other hand, we have 


Var 7(y) = WY ole). 


n 
Under some conditions on the basis and &, n=! © 7} (xi) will be close to one; then 


= t=1 
Var f(y) = n-(1 + o(1)). 
Now, for the risk of the estimator f = S,,(y) one obtains 
Elf — fle = Eqllf — Snolf 3 + BlSaely — f)IB 


< Olr-*) + ¥ Var F,(y) = O(r-%) + m1 + 0(1)). 


je 


We see that the last expression decreases with maximal speed if r is chosen such 
that r-?? w rn-}, i.e. if rw nV @F+), Then 


Ext — file = O(n-2/0+) 
uniformly over f € F, so that 
On? F)= O(n-2812B+1) | ; (8) 


All rate of convergence results for linear estimators in NP regression are basi- 
cally obtained in this way, i.e., by separate evaluation of the bias and variance 
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parts of the risk and a tradeoff between them. This applies also to loss functions 
\f — ff, 1S p S c0, ¢g = 1. In the most general case one considers a risk 


Ell — fllp) 


where J: IR. — R, is a monotone function, and 6, > co is a norming sequence. 
The complement to the result (8) would be that the rate n~?? +1) ig the 
best possible: 


An(F) = n-?Fl (26+) | (9) 


Estimates ot the minimax risk 4,(F) from below are usually derived from the 
Bayes risk for an appropriate sequence of prior distributions. Many variants 
using discrete or continuous priors have been applied in the literature; we cite 
the result of [bragimov and Khasminski (abbreviated [Kh hereafter) (1980b, 
1982a), which concerns the global L,-risk. 

Suppose the bound (9) is proven; it is valid for a particular sequence of 
designs {£}. The question arises whether the optimal rate depends on the par- 
ticular design sequence, and whether the rate (9) can possibly be improved by 
experimental planning. [Kh answered this in the negative (for global loss) 
by proving a result like (9) for a whole class of designs, including randomized 
and sequential ones. To define this class, let (A, 4, P) be some basic probability 
space, and let A, <A be the o-algebra generated by (a, y,), .--, (@n» Yn)- 
If (x1, Y;),---, (4j4,Y;1) are given, the next design point a; is chosen by 
randomization according to a law 


L (x; | Aja)- (10) 


Since that choice is to be made by the statistician, it is natural to require that 
£; does not depend on the unknown f. Any array 


NM, = (F1(- | “), oe a -)), 


where the f,(- | -) are of form (10), will be called an admissible design; their 
entirety will be denoted by I%,. For Theorem 1.3.1 following, let the previous 
assumptions on (1) be modified accordingly. 

Let us introduce some classes of differentiable functions. For any natural 6 
set O(8, L) = {t | f exists, is continuous, ||f\|,, + ||f”||.. S L}. For the lower 
risk bound it is convenient to consider the subclass C,(6, L) of those f in C(f, L) 
which have support contained in (0, 1). 

It is also of interest to consider infinitely differentiable or analytic functions. 
Let A(f, L) be the class of functions on R which are periodic with period 1, 
and admit an extension to the complex domain |Im z| < f such that they are 
analytic and bounded in modulus in this domain by L. By A(f, L) we shall 
also denote the class of restrictions of such functions to the interval [0, 1]. 
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Theorem 1.3.1 (IKh, 1980b, 1982a). Let 1: R,—-IR, be a monotone function 
such that l(0) = 0. For a given function class F ,a sequence {6,} and p,1 < pS 00 
define 
t= liminf inf inf sup B,l(d,\lf — fllp) (11) 
n f Myem, feF 
In each of the cases (a) —(d) below, there is a constant C > 0 not depending on 
L such that + => 1(C)/2. 


(a) F = 0,(6,L), PS p < oo, 0, = nah) 

(b) F as in (a), p = &, by = (n/n n)Pl(26+1) 

(c) F = A(B, L), 1 = p < 00, 6, = (nj/In n)¥2 
(d)_# asin (¢), p = co, 6, = (n/(In m) (In In Welles 


For estimating f at a given point x, the best possible design would be con- 
centrated in the point x, whence the problem would be a parametric one. There- 
fore, NP risk bounds for estimating f(x) should assume a given design, close 
to a uniform one in some sense. The result of Stone (1980) concerns the case 
where the x; are random. 


Theorem 1.3.2 (Stone, 1980) Suppose that {x;} are t.i.d. random variables 
with values in [0, 1], independent of {&;}, which have a density that 1s bounded 
away from zero and infinity on [0, 1]. Let x € [0, 1], | be a function as in Theorem 
1.3.1. For a given function class F and a sequence 6, define 


— liminf inf El(6,(f(x) — : 
Pa rae Ae 


If F = O(B, L), 6, = n*!F+1), then there is a constant C > 0 not depending 
on L such that + => I(C)/2. 


We remark that this lower bound also holds if {&} is the sequence of nonrandom 
uniform designs. 

Let us now turn to the question of how to attain these risk bounds. The 
bounds of Theorem 1.3.1 admit experimental planning; the optimal estimators 
given by [Kh (1980b, 1982a) are based on experimental planning in the form 
of a replicated observation scheme. But it is also of interest to show the attain- 
ment of the optimal rate for more irregular design sequences given in advance; 
we will come to this later. Attainment can be shown for smoothness clases which 
are wider than C(6, LZ) involving generalized derivatives. Define the Sobolev 
class W(f, p, L) for BE N, 1S psoo by Wf, p, L) = {f | fF» exists, is 
absolutely continuous, |/f||} + ||f|} < L}. 

Observe that each class W(f, p, L) contains a class C(f, L) with the same p 
but possibly different Z. The design scheme is of the following form. Let r € IN, 
even, and x; = (7 — 1)/(2r + 1), 7 = 1,..., 2r + 1. Observations are 


Y= f(a;) mAs 7 1... ar 1, me, (12) 
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obtained as averages from m replications. Then n — (2r + 1) m is the total 
number of observations. Consider an estimator 
ar+1 


i perc = (2r 3 8 LyVelw a %;), 
= 


where V, is the trigonometric Vallée-Poussin kernel: V,(t/2a) = 7-2 (cos ((r/2 
+ 1) t) — cos ((r + 1) t)) | sin? (é/2). {V,} is a sequence of functions (trigono- 
metric polynomials) tending to the delta function at zero. This estimator is 
based on approximation theory by means of trigonometric polynomials, and 


hence will be good when the regression function satisfies periodic boundary 
conditions. Consider the periodic Sobolev classes: 


WB, p, L) = {f € W(B, p, L) | (0) = f*(1), k = 0,...,8 — 1}. (18) 
Note that W(6, p, L) > O)(B, L’) for some L’ > 0. 


Theorem 1.3.3 (IKh 1980b, 1982a). Suppose that observations are given by 
the model (12), and that 1: R,— IR, is a measurable function such that U(x) 
S C exp (x*). In each of the cases (a) —(d) of Theorem 1.3.1, the numbers m and 
r can be chosen such that 


inf sup sup E,l(C5,llf mr — fllp) < 00 (14) 
c>O nn fEeF ; 
If the additional condition Bb > 1/p is fulfilled, then this statement is also true 
with Cy(B, L) replaced by W(B, p, L). 


Even wider function classes (Hélder classes in L,) can be admitted in these 
results, allowing also non-entire degree of smoothness f. 

We see that these results solve the asymptotic efficiency problem, at the 
rate of convergence level, for the cases considered. (For the estimation of f 
at a point, attainment of the risk bound of Theorem 1.3.2 was established by 
Stone (1980)). Nevertheless, it is still desirable to remove the assumption of 
special experimental planning with Theorem 1.3.3, and also the assumption 
of periodic boundary conditions fulfilled by the regression function. Various 
results in this respect are available in the literature, mostly based on families of 
linear smoothing operators {S,,,,” € IN, r > 0}. Among these, all well-known 
NP estimation techniques are represented, such as kernel methods, piecewise 
polynomial and spline smoothing. We shall at first note a result of Stone (1982) 
concerning weighted least squares estimation by piecewise polynomials. 

The smoothing operators S,,, are constructed as follows. Let r¢ N and 
mr be the class of functions which are polynomials of degree at most m — 1 
on each interval [(j — 1)/r, j/r), 7 = 1,...,7. Note that functions in 2;,, can 
have jumps. A least squares fit to the data by such functions is considered. For 
the asymptotics, one should require that n is the effective observation number 
also locally, everywhere on the interval. This can be formalized as follows. 


8 Nonlinear Regression 
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Let for natural d be {Q;, 7 = 1, ..., d} a partition of [0, 1] into nonintersecting 
intervals of length d“1, and set 
o(d) = sup 14(Qi)/|Oi 9 é| n-*, 
1Si<d 
where |-| is the cardinality of a set. We assume the existence of a sequence 
{d} = {d,} such that 


r= o(d), v(d)—O(1) as. n, ro. (15) 


This condition does not exclude considerable local deviations of the design 
from the uniform grid with step n~1. Thus, in the least squares criterion, 
weights w; should be used: 


Uy £y(Qi)/|Qi né| for x; € Qi, ) faa eric 


Consider the minimization problem 


min {= u(y; — 9(;)) |9 € Enc} (16) 
j=1 
If condition (15) is fufilled, then this problem has a unique solution f,,,, for 


all sufficiently large n. 


Theorem 1.3.4 Suppose that observations are given by the model (1), and that | 
is a function as in Theorem 1.3.3. If fm, is the minimizer of (16), then relation 
(14) is valid in each of the cases: 
(a) F = W(8, p, L), 1S p < 0c, B > 1p, 6, = nile), m > B—1 

(m fixed), r ~ O4®, »(d) = O(1), d/r > ov. 
(b) p = 00, 6, = (n/In n)#!26+4), all else as in (a). 


This result (a slight modification of that of Stone, 1982) provides, for finite 
smoothness f, the same properties of the estimators as Theorem 1.3.3. Moreover, 
only the minimal condition (15) is required for the design, and no boundary 
conditions are imposed on /f. 

The piecewise polynomial estimator of Theorem 1.3.4 has the drawback 
that it is not smooth. Introducing additional smoothness conditions on the 
functions in 2',,, would lead to spline estimators. Before discussing these, 
however, let us briefly review the kernel method, following Gasser and Miiller 
(1979). Consider a function K: IR — IR with the properties 


ERG) de), | Riayh dg 0° Oe jean eerie ale (17) 


Kernel estimation is based on the approximation properties of the family of 
convolution operators 
1 
(K, « f) (x)= | K,@ — 8) f@) dt, K,(@) =7K(r2), r > 0. 


0 
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There are various possibilities for approximating the convolution integral 
from the data, leading to different kernel estimators. The variant 


f(z) = Pe K,(x — x;) y;/ Y K,(a — 2;) 
7— 1 q=1 

requires a stronger uniformity condition on é to be rate-optimal. In the litera- 

ture, this is extensively treated for the case of random iid. design points 

x; (see e.g. Collomb, 1981). Another possibility is 


fe) = pu K(x — %;) yjU;; (18) 

g=1 
this estimator is similar to one of Priestley and Chao (1972). Gasser and Miiller 
(1979) proposed another method, as follows. Let y*(”) be the random function 
on [0, 1] which, for x € Q;, equals the average of those y; for which x; € Q;. Let 


f(x) = { K,(x —t) y*(t) dt. (19) 


In general, for regression on [0,1], a kernel K with compact support (say 
{—1, 1]) has to be used. Moreover, in the boundary regions [1 — r-}, 1] and 
[0, 7-1], one should use one-sided kernels to overcome edge effects, i.e:, kernels 
with support [0, 1] or [—1, 0], respectively, fulfilling (17). Gasser and Miiller 
(1979) established some optimal rate results for such estimators, based on 
(18) or (19). We note a generalization as follows. 


Theorem 1.3.5 Let fm, be the estimator (19), boundary modified as described, 
where the three kernels used are square-integrable in addition. Then Theorem 1.3.4 
holds with this meaning of f m,r- 


Miiller (1984) constructed appropriate kernels that are, in addition, smooth 
(on IR), which allows the estimator to be made as smooth as f itself. The latter 
two authors also considered the constants appearing in the asymptotic risk, 
and addressed optimality within classes of kernels. A further reference on 
kernel estimators of type (19) is Cheng and Lin (1981). For an application of 
the method of orthogonal series in the present context, we refer to Koryakin 
(1983). 


1.3.38 Spline smoothing 


Splines are a sophisticated tool of approximation theory; their connection 
with the finite element method in numerical analysis has stimulated their 
development. Some of the procedures, such as the Reinsch-Schoenberg smooth- 
ing spline, have been specifically proposed for the case of noisy data, i.e., for 
NP regression. We shall attempt to survey what is known with regard to 
statistical optimality. 


8* 
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Those elements of the space of piecewise polynomials 2,,,, which have some 
continuity or smoothness properties are splines (with equidistant knots). A 
least squares problem similar to (16) for the class of splines 27, = 2m, C™? 
was considered by Agarwal and Studden (1980). A convenient basis of 27, 
is formed by the B-splines. Let By be the m-fold convolution power of the indi- 
cator of [0,771], let B(x) = B(x — jr-+), « € R for each entire 7. Each B; 
is a B-spline with support [jr-1, (j + m) r-1], the set of restrictions to [0, 1] 
of the functions B;, 7 = —m + 1,...,r — 1 spans 2%,,. Consider the minimi- 
zation problem 


min {5 (oi — ale) Lo € Bu] (20) 
j=1 

Theorem 1.3.6 Let fm, be the minimizer of (20). Then Theorem 1.3.4 holds 
with this meaning if f m.r- 


We see that the least squares spline estimator attains the optimal rate in 
L,, 1S pS cw. Agarwal and Studden (1980) proved this for p = 2. These 
authors also analysed the constant appearing in the asymptotic risk and pro- 
posed some optimization within a class of spline estimators. 

Note that m can be chosen equal to 6 + 2, whence the estimator is in C?, 
i.e., satisfies the smoothness assumption made on the regression function. 

Let us now turn to the classical smoothing spline. Consider the Sobolev space 
We = {f | f(™» exists, is absolutely continuous, /(™ € L,} and the minimi- 
zation problem, for 7 > 0, 


min for > (ws ; — glx)? + rg |g € wet. : (21) 


The solution f,,, is unique if € contains at least m + 1 different points, and 
f mr is a spline which is a linear function of the data. This estimator is thus of 
the linear smoothing type, with a particularly appealing heuristic motivation. 
In the risk aaa the bias part can be treated very simply as follows. 
Note that # OF m.x( (x) is the smoothing spline for data f*. Then, if f €¢ W(m, 2, L), 
mr(&) = Bf te) , x € [0, 1], 


m2 YS (fas) — Palay))® S 0-2 Y (Fler) — Pyalee))® + eA, -)MIB 


j=l j=1 
Srp lb Sth, 


since /'),, solves (21) for y = f*. The left- ae side of this chain is an approxi- 
mation to the bias part of the risk ||f — f?,,,|[3. The difficulty now remains in 
evaluating the variance part of the risk; this requires eigenvalue estimates for 
a certain approximation to a differential operator in L,. Results for the case 
of uniform design were given by Wahba (1978), Craven and Wahba (1979). 
Uireras (1980, 1983). These were generalized by Cox (1983, 1984) to a certain 
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class of nonuniform designs. Let ®, be the distribution function which assigns 
the mass n~! to each point of &. Define 
d, = sup |®(«) — a; 


x€[0,1] 


d, is the Kolmogorov distance of ®, from the uniform distribution on [0, 1]. 


Theorem 1.3.7 (Cox, 1984) Let F = W(f,2,L) and fm, be the minimizer 
of (21). The relation 


sup Elif me — fllg = O(n 26 2A+») 
feF 


holds if m = max (8, 2), r ~ n4l26+), and if the condition 
dn cs o(n-5!2(26+1)) (22) 
is fulfilled. 


Note that, for the uniform design sequence, we have d, = O(n-), whence 
(22) is met for all natural 8. The paper of Cox (1984) in fact treats the multi- 
variate case (f: % > IR, X CR’, see point 1.3.5 (a) below). There, also limits 
for ®, that are not uniform are admitted, namely distributions with a density 
which is bounded aways from 0 and oo on [0,1]. Ragozin (1983) obtained 
related results on the convergence rates for the estimation of the derivative of 
f via the smoothing spline. 

Rice and Rosenblatt (1981, 1983) also investigated the asymptotic behaviour 
of smoothing splines, with special emphasis on boundary effects. In the context 
of Theorem 1.3.7 their result means that if 6 > m, then f,,, can attain the 
optimal rate only if f satisfies some additional boundary conditions. This was 
further elaborated by Cox (1984). 

The smoothing spline also has an optimality property for saints fata 
point. For any x € [0, 1] and parameter space W(f, 2, L), it is minimax among 
linear estimators of f(x); cf. Zi (1984). For the asymptotics, the following can 
be conjectured. Suppose {&} is the sequence of uniform designs; the sequence 
6, for a result like Theorem 1.3.2 is 6, = n* 2+), x = 6 — 1/2. (For density 
estimation such a result is known; cf. Wahba, 1975). The smoothing spline 
f o,. attains this rate (for U(x) = x?) for a choice r ~ n. 


1.3.4 Optimal rates and exact constants 


The optimality results for NP regression discussed so far are rather weak com- 
pared with what is known in the parametric case. Consider the minimax risk 
as given by (2), for p = 2. In the k-dimensional parametric case, one has results 


of the type 
lim 2A,(F) = &, (23) 


n 
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(see (6)), while the optimal rate statements in the NP case ensure the existence 
of positive constants C,, C, such that 


G,(1 + of)) < miler F) < O,(1 + 0(1)) (24) 


Now, certainly, it would be desirable to specify constants C,; = C, and thus 
to obtain an analogue of (23), i.e., of Fisher’s bound for asymptotic variances 
(or of the Hajek AM theorem). The methods of risk evaluation mentioned up 
to now do not provide this information; especially with regard to C, they 
are of qualitative nature. While in the general L, case it appears difficult to 
go further, the Hilbertian structure of LZ, allows some improvement. A signi- 
ficant result in this respect is due to Pinsker (1980). It concerns the estimation 
of a function continuously observed in Gaussian white noise. Consider obser- 
vations given by a stochastic differential 


dy(t) = f(t) dé + n-?2 dW(t), FETE; (25) 


where dW(t) is the derivative of the standard Wiener process, and f is a func- 
tion from a parameter set * < L[,. The problem of NP estimation of f in this 
model, with a small noise asymptotic n — oo, has many traits in common 
with NP density or regression estimation; for optimal rates of convergence 
see [Kh (1981, chapter 7, 1980a). Let {y,};<g be an orthonormal basis of Lp, 
f; = (f, ;) be the Fourier coefficients of f. Assume that F is an ellipsoid: 


Fale Daf sal 
i¢Za 


where a;, 7 € Z are coefficients such that a; > oo for 7 > -oo. It is known 
that the periodic Sobolev class W(f, 2, L) (see (13)) can be described as an 
ellipsoid in terms of the classical Fourier basis of Z,; then a; = 1 + (2zj)??, 
7 € Z. Consider the minimax risk for a squared L,-loss: 


A,(F) = inf sup Elf — fl. 
I fF 
Theorem 1.3.8 (Pinsker, 1980). Suppose that observations are given by the 
model (25). Let F = W(B, 2, L). Then 


lim n26I26+2) 4,(F) = y(B, L), 


n 


VBE) ee ee ee 
This result indeed specifies C,; = C, = y(B, L) in (24); it hence provides 


the exact asymptotic minimax constant. To outline the basic idea, let (25) 
be decomposed into Fourier coefficients: 


ni = Jo dy =f, fe, 28) 
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where §;, 7 € Z are iid. standard normal. Estimating f is equivalent to 
estimating /;, 7 € Z. Consider linear estimators of f;: 


f;=om, 7¢ 2B; 


where c; are fixed coefficients. Let o = {o7}jez be some sequence of positive 
oe) 2 : 
_ numbers, ¢ = {¢)} jez, x c; < oo and define 
je 


T(c, 0) = }((1 — ¢,)? oF + n-1c%) 
je 


The expression 7'(c, o) can be interpreted in two ways: 


(a) As the risk Hy||f — f\} of a linear estimator f based on c if f? = 0?, j € Z. 
Indeed, 


Elf — fle = X BP; —f)? = X ((1 — 6)? ff 4+ m1). 
eZ jeZ 
(b) As the mixed risk f Ey\\f — f\f dx,(f) of this linear estimator if the prior 
x, sets the f; independent N(0, o}). 


Let F* = {e leds a;0; <= 1h and observe that 


je 
inf sup 7c, o) = sup inf T(c, o). (27) 
c oGF* o€F* Cc 


Indeed, the conditions of the von Neumann minimax theorem (see, e.g., 
Balakrishnan, 1976) are fulfilled; in particular, 7'(c, o) is convex in c, concave 
(linear) in o, and both domains are convex. A saddle point (c*, o*) is 


of = (L— Jat), oft = nda} — 1), oy 


where / is a solution of 
Y an" (Aas?) — 1); = D. (29) 


Since 7'(c*, o*) is the minimax risk among linear estimators, we have 7'(c*, o*) 
> A,(F). But 7'(c*, o*) is also the Bayes risk for a prior 7,*; indeed, since 
+ is Gaussian, the Bayes estimator is linear, and the right-hand side of (27) 
is a Bayes risk. The prior z,« is not concentrated on J; if it were, then one 
coulde conclude A,(F) => 7(c*, o*). But it can be shown that 2,* concentrates 
on F in some sense as n — oo, whence 


A,(F) = T(c*/o*) (1 + of1)), 2 -> 00. 


The asymptotics of 7'(c*, o*) can easily be calculated for the ellipsoid W(B, 2, L), 
yielding the constant y(6, Z). An efficient estimator in this sense is given by 
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the linear smoother c* of (28). Note that (29) implies 2 > 0 for n — o, so 
that this estimator conforms to the usual linear filtering scheme where the 
number of nonzero coefficients in the filter increases with n. 

A variant of this method leading to optimal kernel estimators was developed 
by Golubev (1982, 1987), using the Fourier transform of functions g € Z,(IR) 
as a tool: 


A f exp (2rxttax) g(x) da. 
R 


Suppose that, in (25), the function f satisfies some different boundary conditions 
on [0, 1]: f € WB, 2, L), where 


W(6,.2, Lb) =f GW (8, 2,.L),\ f"(0) = 7) = 0, ik 0, 2 By 


This class can conveniently be characterized in terms of the Fourier transform: 
If f € W(6, 2, L), then its zero extension to R (f = 0 on [0, 1]°) satisfies 


f FOP (Qe)? dt < LZ. (30) 
R a 
Consider kernel estimators of f in (25): 
1 
= [ Kle—t)dylt),  K,(e) = rK(r2), rr > 0. (31) 
0 


Suppose K € Z,(IR). The risk is 


Ef, — fie S [i t) — f())P dt + 2 st K7(t) 
aha Mone (t)|? de. 


Using (30), the fact that K,(t) = K(r-4t), and a change of variables, we obtain 
Bylf, — fI2 S essup [1 — Rl Lady #4? + nar f (REOP at 
Put r = n/(26+1); then 
E,lif, — fig nee) < L essup [1 = K(t)|? (2ot)-?? + f \R(t)|? de. 


Denote the functional of K which is on the right-hand side by Us, (K). Observe 
that Us 1(K) < co implies that the kernel K satisfies the usual conditions (17) 
(for m = B). The kernel K*, which minimizes U, ;(K), is given by 


R*(t) = (1 — At), (82) 
where 4 solves 


‘ (A146 — #26), dt = L(2n)-?8. 
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It turns out that 
y(8, L) = inf Us, = Uz, 1(K*). 
K 


Theorem 1.3.9 (Golubev, 1982) Suppose that observations are given by the 
model (25). Let F = W(B, 2, L). Then A,(F) satisfies the same relation as in 
Theorem 1.3.8. This risk bound is attained by the estimator (31) with r = nV @6+) 
and kernel K* given by (32). 


This is a rather strong result on optimal kernel estimation in L,, a problem 
which has some history in the literature (see Watson and Leadbetter, 1963; 
Davis, 1977; Gasser and Miiller, 1979). Note that the optimal kernel K* has 
support R, since its Fourier transform has compact support. 

We mention that similar results on the asymptotic minimax constant have 
also been obtained in models of NP spectral and probability density estimation 
(LKh, 1982b; Pinsker and Yefroimovich, 1981, 1982). 

Consider now the NP regression model with discrete observations (1). To 
describe the exact asymptotics of 4,(F) in this model, we employ a spline 
approach. In the preceding results on the AM constant, boundary conditions 
were imposed on the function /. It is of interest to dispense with these in both 
the discrete and continuous observation cases. Thus we assume that 
F = W(f, 2, L), the Sobolev class without boundary conditions. It will also 
be supposed that é is the uniform design of size n. First we ask what kind of 
restriction the smoothness assumption f € W(8, 2, L) implies for the function 
values f*. An answer to this is provided by spline theory. Consider the mini- 
mization problem 


min {Igo 1g € Wg = Fh. 


The solution exists and is unique for n => f, and is the natural polynomial 
interpolation spline, denoted by o(f*). Then 


lof) PR S If lle S L. 


Now o(f*) is known to be linear in f*. Hence, |lo(f*) ||} is a quadratic form, with 
matrix I’, say, and we have 


(fF) Df SL. 


Thus we know that the parameter space for the vector f* is contained in an 
ellipsoid in IR”. (It would coincide with this ellipsoid if F were defined by 
\f|2 < L rather than by |/f\} + |If} < LZ. This difference is unessential 
in the sequel.)-Consider the loss 


n 


oan, (7 (;) are f(x;))?. (33) 


j=1 
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Let I, = OAO’ be a spectral decomposition of I,, where O is an orthogonal 
n <n matrix and A is diagonal with diagonal elements A,; S Jg2 S .-. S Ann. 
Let g = n-/20’f§ be a transformed parameter vector. For g we have obser- 
vations of structure (26), though of finite dimension n. The loss (33) transforms to 

n n 
> (9; — 9;)?, and g is contained in an ellipsoid in R*: ») aig; < L, where the 
aa ay 
3 coincide with nA,;. The method of Pinsker (1980) jon then be applied to 
this observation scheme; it describes the exact asymptotics of 4,(7). The 
concrete rate and constant then depend on the numbers a; and on L. The rate 
is already known to be n~26/(?4+1) (cf. Theorems 1.3.1, 1.3.4), and it turns out 
that the constant is the same as in Theorem 1.3.8. The proof requires an eigen- 
value estimate for nI’,, viz. it is to be shown that the a; tend to behave like 
(xj). The problem of spectral estimates for the matrix I, associated with 
spline interpolation was treated earlier by Craven and Wahba (1979) and 
Utreras (1980, 1983). 

To describe the asymptotically optimal estimator, let r= n/(6+) and 
define numbers 


€é=1, 15jSrflogn, 6 =K*(G/2r), rilogn<jsn, 


with K* from (32). The é; represent a slightly modified version of the optimal 
filter c* from (28); the number of nonzero é; is of order r. Define 


C= (Gon. (6;; — Kronecker symbol) 


and let 
fi = 060'y 


be an estimator of f*. The estimator of f will be the interpolating spline 
f = o(f*). (34) 


Theorem 1.3.10 Suppose that observations are given by the model (1), and that 
— 1s the uniform design of size n. Let F = W(f, 2, L). Then A,(F) satisfies the 
same relation as in Theorem 1.3.8. This risk bound is attained by the estimator 
(34). 

For details see Nussbaum (1985). The optimal estimator (34) may be viewed 
as a smoothing spline different from the classical one (21). It is known that the 
latter corresponds to a filter 


(1 + r(20j)?4)-1 


(approximately; here r is the number appearing in (21)). Hence it will not be 
optimal in the sense of the AM constant, whatever the coice of r. The optimality 
properties of (21) for estimating f at a point were mentioned earlier. 
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A method to remove the boundary conditions on f in Theorem 1.3.9 within 
the framework of the kernel method was proposed by Golubev (1987). 
It amounts to a boundary modification similar to the one used by Gasser and 
Miiller (1979) (cf. Theorem 1.3.5). In the boundary regions, any boundary 
kernel fulfilling (17) is used. In the interior of the interval, a sequence of kernels 
with compact support, fulfilling (17), which approximates the optimal kernel 
K* is used. Since K* has noncompact support, it cannot itself be used in this 
scheme because of the boundary effect. We also mention the paper of Golubev 
(1984), which treats the exact asymptotics of 4,(F) for a regression model 
with an expanding interval of observation. 

Results like Theorem 1.3.10 can also be established for nonuniform designs, 
e.g. for those fulfilling condition (22), but we shall not dwell on this. Instead, 
for some more basic insight, we note some results of Sacks and Strawderman 
(1982) concerning the estimation of f at a point x. These authors show that, 
for some smoothness classes * and quadratic loss, linear estimators of f(z) 
attain the optimal rate of convergence but not the best constant in the AM sense. 
This is established by constructing nonlinear improvements of minimax linear 
estimators. Hence, the linear method which led to the result of Theorem 1.3.8 
is not available in these cases, and the calculation of exact AM constants appears 
to be substantially more difficult for the estimation of f at a point. This also 
seems to be the case for losses in Lp, p + 2. 


1.3.5 Some further topics 


(a) The multivariate case 


Consider a regression function of several variables: f: % > IR, % CR‘. 
Assume that % is open and bounded. Convergence rates for estimating f 
at a point were established by Stone (1980). If F is a smoothness class described 
in terms of partial derivatives up to order f, the optimal rate (for squared 
error) is 2~2//(26+*), For estimating with global loss in L,(%), 1S p<, 
lower risk bounds analogous to those in Theorem 1.3.1 have been found by 
Stone (1982) and Nussbaum (1982). To show the attainment of these bounds, 
for domains X of sufficiently arbitrary shape, the edge problem associated 
with a curvilinear boundary has to be solved. Using piecewise polynomial 
estimation, Stone (1982) proved attainment when the loss is defined in L,(2*), 
where &* is some subset of X with positive distance from the boundary of %. 
Cox (1984) established attainment for multivariate smoothing splines, for 
squared L,(Z )-loss. The conditions on % are that & is compact, simply connec- 
ted and has a boundary of class C™. It is possible to prove attainment of optimal 
rates for piecewise polynomial estimation, viz. for loss in L,(%), 1 <p S o, 
and compact & with Lipschitz boundary; see Nussbaum (1986). Even smooth 
linear spline estimators can be employed for this purpose, i.e., linear combina- 
tions of multivariate B-splines. 
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(b) Nonlinear estimators and general aspects 


Up to now, only estimators which are linear in the data have been discussed. 
The direct application of the method of maximum likelihood, for the parameter 
spaces ¥ considered, can provide well-defined estimators in the present model, 
as opposed to density estimation. This leads to nonlinear estimators in general ; 
see Nemirovski, Polyak, and Tsybakov (1984). They attain optimal rates in 
some cases where linear estimators fail. Discretized ML estimators have been 
employed in connection with the ‘method of sieves’ (see Geman and Hwang, 1982), 
which has been proposed as a unifying concept for estimators in NP problems. 
A unifying concept for linear estimators is the delta method (Susarla and 
Walter, 1981). Optimal rates of convergence in an abstract general setting of 
‘NP estimation have been studied by Birgé (1983). A recent monograph on 
general curve estimation is Prakasa Rao (1984). 

For a survey of NP regression models with random (x;, y;) see Collomb (1981). 
Limit theorems for estimators and various probabilistic properties have recei- 
ved much attention; see e.g. Liero (1982). 


(c) Robustness 


The normal distributional assumption on the i.i.d. disturbance variables ¢; 
is not necessary for the results on optimal rates; some regularity conditions 
suffice. Some papers address robustness against variation in the error distri- 
bution. It is possible to ‘robustify’ the linear smoothers such as kernel and 
spline methods; the result is a nonlinear smoother whose robustness is for 
example reflected in an extended optimal rate property. The smoothing spline 
case has been dealt with by Cox (1983); for the kernel method we mention 
Tsybakov (1982) and Hardle (1984). 


(d) Adaptive optimal smoothing 


The optimality results for the smoothing methods S,,, discussed so far were 
proven for certain choices of the smoothing parameter 7; these choices depend 
on the prior information on the function class ¥. In practice, information on FJ, 
for instance on the bounds for the derivative, may be rather vague, so that 
it is of major interest to derive a good choice of r from the data. Most of the 
methods proposed are based on cross-validation or variants of it; see e.g. 
Craven and Wahba (1979), Wahba (1981), Utreras (1980). Consistency was 
proved by Lz (1984); a first result on efficiency is due to Hall (1983, 1984). 
This efficiency result, though it is persuasive, does not state an actual asympto- 
tic risk optimality of the adaptive (cross-validated) estimator, however. A 
method not related to cross-validation which achieves this goal has been pro- 
posed by Pinsker and Yefroimovich (1984). Within the ellipsoid framework des- 
cribed in Section 1.3.4, this estimator yields the optimal rate and constant for 
the squared L,-risk, although it does not depend on the concrete parameters 
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of the ellipsoid. Initial results relate to the continuous time model (25) but the 
method is applicable to NP regression too. For further advances on the effi- 


ciency of cross-validation see Rice (1984), Speckman (1985), Hardle and Marron 
(1985). 


1.4 - References 


1.4.1 References for Section 1.1 


Agha, M. (1971). ‘A direct method for fitting linear combinations of exponentials.’ 
Biometrics, 27, 399—413. 

Akahira, M. and Takeuchi, M. (1976). ‘On the second order asymptotic efficiencies of 
estimators.’ Proc. 3rd Japan-USSR Symp. on Prob. Theory, Lectwre Notes in Math. 
550, Springer Verlag, 1976, 604—638. 

Amari, Shun-Ichi (1982). ‘Differential geometry of curved exponential families — cur- 
vature and information loss.’ Ann., Statist. 10, 357—385. 

Anderson, T. W. (1958). An Introduction to Multivariate Analysis. John Wiley, New 
York. 

Anderson, T. W., and Taylor, J. B. (1976). ‘Strong consistency of least squares esti- 
mators in normal linear regressions.’ Ann. Statist., 4, 788 —790. 

Bahadur, R. R. (1964). ‘On Fisher’s bound for asymptotic variance.’ Ann. Math. Stastist., 
35, 1545—1552. 

Bahadur, R. R. (1967). ‘Rates of convergence of estimates and test statistics.’ Ann. 
Math. Statist., 39, 303—324. 

Bard, J. (1974). Nonlinear Parameter Estimation. Academic Press, New York. 

Barham, R. H., and Drane, W. (1972). ‘An algorithm for least squares estimation of 
nonlinear parameters when some of the parameters are linear.’ T’echnometrics, 14, 
757 —766. 

Barnett, W. A. (1976). ‘Maximum likelihood and iterated Aitken estimation of non- 
linear systems of equations.’ J. Amer. Statist. Assoc., 71, 354—360. 

Bates, D. M., and Watts, G. G. (1980). ‘Relative measures of nonlinearity (with discus- 
sion).’ J. Royal Statist. Soc., Ser. B, 42, 1—25. 

Beale, H. M. L. (1960). ‘Confidence regions in nonlinear estimation.’ J. Royal. Statist. 
Soc., Ser. B, 22, 41—71. 

Bird, H. A., and Milliken, G. A. (1976). ‘Estimable functions in the nonlinear models.’ 
Commun. Statist., A5, 999—1012. 

Box, G. H. P., and Coutie, G. A. (1956). “Application of digital computers in the explora- 

_tion_of functional relation ship.’ Proc. I. H. H., 108, Part B, Suppl. Nr. 1, 100—107. 
{ Box, M. J. (1971). ‘Bias in nonlinear estimation.’ J. Royal Statist. Soc., Ser. B, 33, 171 to 

ae 15 

Biwke, H. (1976). ‘Simple consistent estimation in nonlinear regression by data trans- 
formations and design of experiments.’ Math. Operationsforsch. Statist., 7, 715—719. 

Bunke, H. (1977). ‘Linear parameter estimation in nonlinear regression models by pre- 
vious data transformations.’ Biometr. J., 19, 253—256. 

Bunke, H. (1981). ‘A note on parameter estimation in inadequate nonlinear reression 
models.’ Math. Operationsforsch. Statist., Ser. Statist., 12, 7—11. 

Bunke, H., and Bunke, O. (1974). ‘Identifiability and estimability.’ Math. Operations- 
forsch. Statist., 5, 223—233. 

Bunke, H., and Bunke, O. (Eds.) (1986). Statistical Inference in Linear Models. John 
Wiley, Chichester. 


126 Chapter 1. Parameter estimation and testing hypotheses in nonlinear models 


Bunke, H., Henschke, K., Striiby, R., and Wisotzki, C. (1977). ‘Parameter estimation 
in nonlinear regression models.’ Math. Operationsforsch. Statist., Ser. Statist., 8, 
23—40. 

Bunke, H., and Schmidt, W. H. (1980). ‘Asymptotic results on nonlinear approximation 
of regression functions and weighted least squares.’ Math. Operationsforsch. Statist., 
Ser. Statist., 11, 3—22. 

Bunke, O., and Grabowski, B. (1978). ‘A procedure for model choice or variable selection 
with controlled model specification error.’ Math. Operationsforsch. Statist., Ser. 
Statist., 9, 483—497. ; 

Chanda, K. C. (1976). ‘Efficiency and robustness of least squares estimators.’ Sankhya, 
Ser. B, 38, 153—163. 

Chibisov, D. M. (1972). ‘An asymptotic expansion for the distribution of a statistics 
that permits asymptotic expansion.’ Theory of Probability and its Applications, 17, 
658—668, (in Russian). 

Chibisov, D. M. (1973a). ‘An asymptotic expansion for a certain class of estimators 
that includes maximum likelihood estimators.’ Theory of Probability and its Appli- 
cations, 18, 302—310, (in Russian). 

Chibisov, D. M. (1973b). ‘An asymptotic expansion for the distribution of sums of a 
special form, with an application to minimum contrast estimates.’ Theory of Pro- 
bability and its Applications, 18, 689—702, (in Russian). 

Cox, D. R. (1977). ‘Nonlinear models, residuals and transformations.’ Math. Operations- 
forsch. Statist., Ser. Statist., 8, 3—22. 

Draper, N.R., and Smith, H. (1966). Applied Regression Analysis. John Wiley, New 
York. 

Drygas, H. (1971). ‘Consistency of the least-squares and Gauss-Markov estimators in 
regression models.’ Z. Wahrscheinlichkeitstheorie verw. Gebiete, 17, 309—326. 

Drygas, H. (1976). ‘Weak and strong consistency of the least-squares estimators in 
regression models.’ Z. Wahrscheinlichkeitstheorie verw. Gebiete, 34, 119—127. 

Evcker, I’. (1963a). ‘Central limit theorems for families of sequences of random variables.’ 
Ann. Math. Statist., 84, 439—446. 

Hicker, £. (1963b). ‘Asymptotic normality and consistency of the least-squares esti- 
mators for families of linear regressions.’ Ann. Math. Statist., 34, 447—456. 

Ecker, F. (1965). ‘Limit theorems for regressions with unequal and dependent errors.’ 
Proc. of the 5th Berkeley Symp. Math. Statist. Prob., 1, 59—82. Univ. California Press, 
Berkeley. 

Hicker, F. (1966). ‘A multivariate central limit theorem for random linear vector forms.’ 
Ann. Math. Statist., 37, 1825—1828. 

Fedorov, V. V. (1977). ‘Estimation of regression parameters in the case of vector valued 
observations.’ In: Regression Experiments (Kd. V. V. Nalimov). Moscow, Izd. Mosk. 
Univ. (in Russian). 

Gallant, A. R. (1975). ‘Testing a subset of the parameters of a nonlinear regression 
model.’ J. Amer. Statist. Assoc., 70, 927—932. 

Gleser, L. J. (1965). ‘On the asymptotic theory of fixed-size sequential confidence 
bounds for linear regression parameters.’ Ann. Math. Statist., 36, 463—467. 

Gleser, L. J. (1966). ‘Correction to: On the asymptotic theory of fixed-size sequential 
confidence bounds for linear regression parameters.’ Ann. Math. Statist., 37, 1053 
to 1055. 

Goldberger, A. S. (1968). ‘The interpretation and estimation of Cobb-Douglas functions.’ 
Econometrica, 36, 464—472. . 

Goldfeld, S. M., and Quandt, R. HL. (1972). Nonlinear Methods in Econometrics. North- 
Holland Publishing Company, Amsterdam— London. 

Grossmann, W. (1976). ‘Robust nonlinear regression.’ In: Compstat 1976 (Ed. G. Bruck- 
mann), Physica-Verlag, Wiirzburg, 146—152. 


1.4. References 127 
Be ek ON ae aia Rp Dh Sa a ed 


Hamilton, D. C., Watts, G. D., and Bates, D. M. (1982). ‘Accounting for intrinsic non- 
te in nonlinear regression parameter inference regions.’ Ann. Statist., 10, 

Hannan, HE. J. (1971). ‘Nonlinear time series regression.’ J. Appl. Prob., 8, 767—780. 

Hartley, H.O. (1971). ‘The modified Gauss-Newton method for fitting of nonlinear 
regression functions by least squares.’ T'echnometrics, 3, 269—280. 

Hoffmann, K. (1977). ‘Robust alternatives of the least squares estimator.’ Math. Ope- 
rationsforsch. Statist., Ser. Statist., 8, 305—311. 

Huber, P. J. (1964). ‘Robust regression of a location parameter.’ Ann. Math. Statist., 
85, 73—101. 

Jennrich, k. I. (1969). “Asymptotic properties of nonlinear least squares estimators.’ 
Ann. Math. Statist., 40, 633—643. 

Kruskal, W. (1968). ‘When are Gauss-Markov and least squares estimators identical? 
A coordinate-free approach.’ Ann. Math. Statist., 39, 70—75. 

Léuter, H. (1989). ‘Note on the strong consistency of the least squares estimator in 
nonlinear regression.’ Statistics, 20, 2. 

Lawton, H., and Sylvestre, H. A. (1971). ‘Elimination of linear parameters in nonlinear 
regression’. T'echnometrics, 18, 461 —467. 

McGilchrist, C. A. (1968). ‘Efficient difference equation. estimators in exponential 
regression.’ Ann. Math. Statist., 39, 1938—1945. 

Malinvaud, E. (1970). Statistical Methods of Econometrics (2nd rev. ed.). North-Holland 
Publishing Company, Amsterdam — London. 

Marquardt, D. W. (1963). ‘An algorithm for least squares estimation of nonlinear para- 
meters.’ STAM J. Appl. Math., 11, 431—441. 

Michel, R. (1975). ‘An asymptotic expansion for the distribution of asymptotic maxi- 
mum likelihood estimators of vector parameters.’ J. Multivariate Anal. 5, 67—82. 

Natanson, I. P. (1955). Konstruktive Funktionentheorie. Akademie-Verlag, Berlin. 

Nelder, J. A. (1961). ‘The fitting of a generalization of the logistic curve.’ Biometrics, 
18, 89—110. 

Nelder, J. A. (1962). ‘An alternative form of a generalized logistic function.’ Biometrics, 
18, 614—616. 

Nussbaum, M. (1977). ‘Asymptotic efficiency of estimators in the multivariate linear 
model.’ Math. Operationsforsch. Statist., Ser. Statist., 8, 173—198. 

Petersen, I. (1969). ‘Comparison of the method of reproducing kernels with the method 
of least squares.’ Izv. AN Eston. SSR, Fiz. Mat., 18, 403 (in Russian). 

Pfanzagl, J. (1973). ‘Asymptotic expansion related to minimum contrast estimators.’ 
Ann. Statist., 1, 993—1026. 

Pfanzagl, J. (1973b). ‘Asymptotically optimum estimation and test procedures.’ 
Proc. Prague Conf. on Asymptotic Methods of Statistics, 1, 201—272. 

Rasch, D. (1967). Schdtzprobleme bei eigentlich nichtlinearen Regressionsfunktionen. Abh. 
Dt. Akad. Wiss., 121—128. 

Ratkowski, D. A. (1983). Nonlinear Regression Modeling: A Unified Practical Approach. 
(Statistics: Textbooks and Monographs Series, Vol. 48), Marcel Dekker, Inc., New 
York. 

Saleh, A. E., and Choudry, G. H. (1975). ‘On fitting exponential regressions.’ Statistische 
Hefte, 16, 213—222. 

Schmidt, F. (1983). Kleinste Quadrat Schatzwng in nichtlinearen Regressionsmodellen. 
Vandenhoeck & Ruprecht, Gottingen. 

Schmidt, W. H. (1975a). ‘Asymptotic normality of least-squares estimators in multi- 
variate singular linear models.’ Math. Operationsforsch. Statist., 6, 285—300. 

Schmidt, W. H. (1975b). ‘Asymptotic optimality of estimators in multivariate linear 
models.’ Math. Operationsforsch. Statist., 6, 713—731. 


128 Chapter 1. Parameter estimation and testing hypotheses in nonlinear models 


Schmidt, W. H. (1976). ‘Strong consistency of variance estimation and asymptotic 
theory for tests of the linear hypothesis in multivariate linear models.’ Math. Opera- 
tionsforsch. Statist., 7, 701—705. 

Schmidt, W. H. (1977). ‘Asymptotics in multivariate linear models with optimal experi- 
mental designs.’ Math. Operationsforsch. Statist., 8, 447—452. 

Schmidt, W. H. (1979). ‘Asymptotic results for estimation and testing variances in 
regression models.’ Math. Operationsforsch. Statist., Ser. Statist., 10, 209 —236. 

Schmidt, W. H., and Zwanzig, S. (1986). ‘Second order asymptotics in nonlinear re- 
gression.’ J. Multivariate Anal., 18, 187—215. 

Schénfeld, P. (1969). Methoden der Okonometrie, Bd.1: Lineare Regressionsmodelle. 
Verlag Franz Vahlen GmbH, Berlin und Frankfurt (Main). 

Stoer, J. (1972). Numerische Mathematik I. Springer-Verlag, Berlin. 

Wisotzki, C. (1977). ‘Polynomial approximation of nonlinear regression functions.’ 
Math. Operationsforsch. Statist., Ser. Statist., 8, 313—321. 

Wu, Chien-Fu (1981). ‘Asymptotic theory of nonlinear least squares estimation.’ Ann. 
Statist., 9, 501—513. 

Zwanzig, S. (1980). ‘Inadequate least squares.’ Math. Operationsforsch. Statist., Ser. 
Statist., 11, 23—48. 


1 References for Section 1.2 


Agarwal, G.G., and Studden, W. J. (1978). ‘Asymptotic design and estimation using 
linear splines.’ Commun. Statist. — Simula. Computa., B7, 309—319. 

Ahlberg, J. H., Nilson, H. N., and Walsh, J. L. (1967). The Theory of Splines and Their 
Applications. Academic Press, New York. 

Bacon, D. W., and Watts, D. G. (1971). ‘Estimating the transition between two inter- 
secting straight lines.’ Brometrika, 58, 525—534. 

Barnard, G. A. (1959). ‘Control charts and stochastic processes.’ J. Royal Statist. Soc., 
Ser. B, 21, 239—271. 

Borodjuk W.P., and Lezki, BH. K. (1977). Grundlagen der Verfahrenstechnik und chemi- 
schen Technologie. Statistische Modellierung verfahrenstechnischer Systeme. Akademie- 
Verlag, Berlin. 

Brown, R. L., Durbin, J., and Evans, J. M. (1975). ‘Techniques for testing the consi- 
stency of regression relationships over time (with discussion).’ J. Royal Statist. Soc., 
Ser. B, 37, 149—192. ; 

Bunke, H. (1973). ‘Approximation of regression functions. 
Statist., 4, 314—325. 

Bunke, H., and Bunke, O. (1974). ‘Das empirische Entscheidungsprinzip und die Wahl 
von Regressionsmodellen.’ Biometr. Zeitschr., 16, 167—184. 

Bunke, H., and Bunke, O. (Eds.) (1986). Statistical Interference in Linear Models. John 
Wiley, Chichester. 

Bunke, H., and Schulze, U.(1984). ‘Approximation of change points in regression models.’ 
Proc. of the 1st Intern. Tampere Seminar on Linear Statist. Models and their Appli- 
cations, Univ. of Tampere. 

Buse, A., and Lim, L. (1977). “Cubic splines as a special case of restricted least squares.’ 
J. Amer. Statist. Assoc., 72, 64—68. 

Chow, G. C. (1960). ‘Tests of equality between sets of coefficients in two linear regres- 
sions.’ Hconometrica, 28, 591—605. ; 

Dathe, H. M., and Miller, P. H. (1980). ‘A contribution to spline regression.’ Biometr. J., 
22, 259 — 269. 

Dumncz, B. L. (1969). ‘“Discontinuities in the surface structure of alcohol-water mix- 
tures.’ Kolloid-Zeitschr. wu. Zeitschrift f. Polymere, 230, 346—357. 


? 


Math. Operationsforsch. 


1.4. References 129 


I a ee ee ee ee, 


Eder, F. X. (1968). Moderne MeBmethoden der Physik, Teil 1. VEB Deutscher Verlag 
der Wissenschaften, Berlin. 

Eriel, J. H., and Fowlkes, E. B. (1976). ‘Some algorithms for linear spline and piecewise 
multiple linear regression.’ J. Amer. Statist. Assoc., 71, 640—648. 

Fair, R. C., and Jaffee, D. M. (1972). ‘Methods of estimation for markets in disequili- 
brium.’ Hconometrica, 40, 497—514. 

Farley, J. U., and Hinich, M. J. (1970). ‘A test for a shifting slope coefficient in a linear 
model.’ J. Amer. Statist. Assoc., 65, 1320—1329. 

Feder, P.I. (1975a). ‘On asymptotic distribution theory in segmented regression problems 
— identified case.’ Ann. Statist., 3, 49—83. 

Feder, P. I. (i975b). “The log likelihood ratio in segmented regression.’ Ann. Statist., 8, 
84—97. 

Ferreira, P. H. (1975). ‘A Bayesian analysis of a switching regression model: known 
number of regimes.’ J. Amer. Statist. Assoc., 70, 370—374. 

Gallant, A. R., and Fuller, W. A. (1973). ‘Fitting segmented polynomial regression 
models whose join points have to be estimated.’ J. Amer. Statist. Assoc., 68, 144—147. 

Garbade, K. (1977). “IT'wo methods for examining the stability of regression coefficients.’ 
J. Amer. Statist. Assoc., 72, 54—63. 

Goldfeld, S. M., and Quandt, R. HL. (1972). Nonlinear Methods in Econometrics. North- 
Holland Publ. Comp., Amsterdam. 

Goldfeld, S. M., and Quandt, R. H. (1973). ‘A Markov model for switching regressions.’ 
J. Econometrics, 1, 3—16. 

Guthery, S. B. (1974). ‘Partition regression.’ J. Amer. Statist. Assoc., 69, 945—947. 

Hackl, P. (1980). Testing the Constancy of Regression Models over Time. Angew. Sta- 
tistik u. Okonometrie, Heft 16. Vandenhoeck & Ruprecht, Gottingen. 

Halpern, HE. F. (1973). ‘Bayesian spline regression when the number of knots is unknown.’ 
J. Royal Statist. Soc., Ser. B, 35, 347—360. 

Hinkley, D. V. (1969). ‘Inference about the intersection in two-phase regression.’ Bio- 
metrika, 56, 495—504. 

Hinkley, D. V. (1971). ‘Inference in two-phase regression.’ J. Amer. Statist. Assoc., 66, 
736—743. 

Holbert, D., and Broemeling, L. (1977). ‘Bayesian inferences related to shifting sequences 
and two-phase regression.’ Comm. Statist.-Theor. Meth., A6, 265—275. 

Hudson, D. J. (1966). ‘Fitting segmented curves whose join points have to be estima- 
ted.’ J. Amer. Statist. Assoc., 61, 1097—1124. 

Jennrich, R. I. (1969). ‘Asymptotic properties of nonlinear least squares estimators.’ 
Ann. Math. Statist. 40, 633 —643. 

Jupp, D. L. B. (1978). ‘Approximation to data by splines with free knots.’ SIAM J. 
Num. Anal., 15, 328—343. 

McGee, V. E., and Carleton, 7. W. (1970). ‘Piecewise regression.’ J. Amer. Statist. Assoc., 
65, 1109—1124. 

MacNeill, I. B. (1978). ‘Properties of sequences of partial sums of polynomial regression 
residuals with applications to test for change of regression at unknown times.’ Ann. 
Statist., 6, 422 —433. 

Park, S. H. (1978). ‘Experimental designs for fitting segmented polynomial regression 
models.’ Technometrics., 20, 151—154. 

Paul, R. (1974). Halbleiterphysik. VEB Verlag Technik, Berlin. 

Physik in Ubersichten. (1973). Volk und Wissen, Berlin. 

Poirier, D. J. (1973). ‘Piecewise regression using cubic splines.’ J. Amer. Statist. Assoc., 
68, 515—524. 

Quandt, R. E. (1958). ‘The estimation of the parameters of a linear regression system 
obeying two separate regimes.’ J. Amer. Statist. Assoc., 53, 873 —880. 


9 Nonlinear Regression 


130 Chapter 1. Parameter estimation and testing hypotheses in nonlinear models 


Quandt, R. E. (1960). ‘Tests of the hypothesis that a linear regression system obeys two 
separate regimes.’ J. Amer. Statist. Assoc., 55, 324—330. 

Quandt, R. H. (1972). ‘A new approach to estimating switching regression.’ J. Amer. 
Statist. Assoc., 67, 306—310. 

Quandt, R. E., and Ramsey, J. B. (1978). ‘Estimating mixtures of normal distributions 
and switching regression. (With discussion).’ J. Amer. Statist. Assoc., 73, T30—752. 

Ramsey, J. B. (1969). ‘Tests for specification errors in classical linear least-squared 
regression analysis.’ J. Royal Statist. Soc., 31, 350—371. 

Robison, D. HE. (1964). ‘Estimates for the points of intersection of two polynomial 
regressions.’ J. Amer. Statist. Assoc., 59, 214—224. 

Roy, S. N. (1953). ‘On a heuristic method of test construction and its use in multivariate 
analysis.’ Ann. Math. Statist., 24, 220—238. 

| Schmidt, P., and Sickles, R. (1977). ‘Further evidence on the use of the Chow test under 
heteroscedasticity.’ Hconometrica, 45, 1293 —1298. 

Schulze, U. (1973). ‘Regressionsmodelle mit verschiedenen Zustaénden.’ Diplomarbeit, 
Humboldt-Universitat, Berlin. 

Schulze, U. (1977a). ‘Estimation of the unknown change-point between regression 
regimes.’ ikm VII. Intern. Kongr. tber Anwendungen d. Math. in den Ingenieur- 
wissensch. mit d. Rahmenthema: Anwendungen d. elektronischen Datenverarbeitung 
im Bauwesen, Weimar 1975. 

Schulze, U. (1977b). ‘Identifikation von Zustandsinderungen.’ Poster session, 3. Intern. 
Sommerschule ‘Modellwahl’, Miihlhausen. 

Schulze, U. (1982). ‘Modelle mit Zustandsinderungen.’ Dissertation. Akademie der 
Wissenschaften der DDR, Berlin. 

Sprent, P. (1961). ‘Some hypotheses concerning two phase regression lines.’ Biometrics, 
17, 634—645. 

Thalheim, W. (1977). ‘Prifung linearer Modelle.’ Diplomarbeit, Humboldt-Universitat, 
Berlin. 

Toyada, T’. (1974). “Use of the Chow test under heteroscedasticity.’ Econometrica, 42, 
601—608. 

Wold, S. (1974). ‘Spline functions in data analysis.’ T'echnometrics, 16, 1—11. 

Yayatissa, W. A. (1977). “Tests of equality between sets of coefficients in two linear 
regressions when disturbance variances are unequal.’ Hconometrica, 45, 1291 —1292. 


1.4.3 References for Section 1.3 


Agarwal, G. G., and Studden, W. J. (1980). ‘Asymptotic integrated mean square error 
using least squares and bias minimizing splines.’ Ann. Statist., 8, 1307—1325. 

Balakrishnan, A. V. (1976). Applied Functional Analysis. Springer-Verlag, New York. 

Beran, R. (1982). “Robust estimation in models for independent non-identically distri- 
buted data.’ Ann. Statist., 10, 415—428. 

Birgé, L. (1983). ‘Approximation dans les espaces métriques et théoric de l’estimation.’ 
Z. Wahrsch. verw. Gebicie, 65, 181 —237. 

Bunke, O. (1985). ‘A nonparametric small sample theory of estimation of regression 
functions.’ Proc. Fourth Pannonian Symp. on Math. Statistics, Bad Tatzmannsdorf 1983, 
North-Holland Publ. Co, Amsterdam. 

Cheng, K. F., and Lin, P. E. (1981). ‘Nonparametric estimation of a regression function.’ 
Z. Wahrsch. verw. Gebiete, 57, 223 —233. : 

Collomb, G. (1981). ‘Estimation non-paramétrique de la régression: revue bibliographi- 
que. ‘Internat. Statist. Review, 49, (1), 75—93. 

Cox, D. D. (1983). ‘Asymptotics for M-type smoothing splines.’ Ann. Statist., 11, 530 
to 551. 


1.4. References 131 
ere een a a ee sty Shee 


Cox, D. D. (1984). ‘Multivariate smoothing spline functions.’ SIAM J. Numer. Anal., 
21, 789—813. 

Craven, P., and Wahba, G. (1979). ‘Smoothing noisy data with spline functions.’ Nu- 
merische Math., 31, 377—403. 

Davis, K. B. (1977). ‘Mean integrated square error properties of density estimates.’ 
Ann. Statist., 5, 580—535. 

Fedotov, A. M. (1981). ‘An information inequality for operator equations in Hilbert 
space.’ Theory Probab. Appl., 26, 377 —384 (in Russian). 

Fedotov, A. M. (1982). Linear Ill-posed Problems with Random Errors in the Data. 
Nauka, Novosibirsk (in Russian). 

Gasser, Th., and Miller, H.-G. (1979). ‘Kernel estimation of regression functions.’ In: 
Smoothing Techniques for Curve Estimation (Th. Gasser, M. Rosenblatt, Eds.). Lecture 
Notes in Math. 757, 23—68, Springer-Verlag New York. 

Geman, S., and Hwang, C. R. (1982). ‘Nonparametric maximum liklihood estimation 
by the method of sieves.’ Ann. Statist., 10, 401—414. 

Golubev, G. K. (1982). ‘On minimax filtering of functions in L,.’ Problems Inform. 
Transmission, 18 (4), 67—75 (in Russian). 

Golubev, G. K. (1984). ‘On minimax estimation of regression.’ Problems Inform. Trans- 
mission, 20 (1), 56—64 (in Russian). 

Golubev, G. K. (1987). ‘Adaptive asymptotically minimax estimates of smooth signals.’ 
Problems Inform. Transmission, 28 (1), 57—67 (in Russian). 

Hajek, J. (1973). ‘Local asymptotic minimax and admissibility in estimation.’ Proc. of 
the Sixth Berkeley Symposium on Math. Stat. and Probability. University of California 
Press, Vol. 1, 175—194. 

Hall, P. (1983). ‘Large sample optimality of least squares cross-validation in density 
estimation.’ Ann. Statist., 11, 1156—1174. 

Hall, P. (1984). ‘Asymptotic properties of integrated square error and cross-validation 
for the kernel estimation of a regression function.’ Z. Wahrsch. verw. Gebiete, 67, 
175—196. 

Hardle, W. (1984). ‘Robust regression function estimation.’ J. Multivar. Anal., 14, 
169—180. 

Haérdle, W., and Marron, J. S. (1985). ‘Optimal bandwidth selection in nonparametric 
regression function estimation.’ Ann. Statist. 18, 1465—1481. 

Ibragimov, I. A., and Khasminski, R. Z. (1980a). ‘Asymptotic properties of some non- 
parametric estimates in Gaussian white noise.’ In: Proceedings of the Summer School 
in Math. Statistics (Varna, 1978), BAN, Sofia (in Russian). 

Ibragimov, I. A., and Khasminski, R. Z. (1980b). ‘Asymptotic efficiency bounds for 
nonparametric estimation of a regression function in L,. Zapiski nauénych seminarov 
LOMI, 97, 88—101 (in Russian). 

Ibragimov, I. A., and Khasminski, R. Z. (1981). Statistical Estimation: Asymptotic 
Theory. Springer-Verlag, New York. 

Ibragimov, I. A., and Khasminski, R. Z. (1982). ‘Bounds for the risk of nonparametric 
estimates of regression.’ Theory Probab. Appl., 27, 81—-94 (in Russian). 

Ibragimov, I. A., and Khasminski, R. Z. (1982b). ‘On density estimation within a class of 
entire functions.’ Theory Probab. Appl., 27, 514—524 (in Russian). 

Koryakin, A.I. (1983). ‘Estimation of a function from randomized observations.’ 
Zhurn. vychisl, mat. i matemat. fiziki, 28 (1), 21—28 (in Russian). 

Li, Ker-Chau (1982). ‘Minimaxity of the method of regularization on stochastic pro- 
cesses.’ Ann. Statist., 11, 141—156. 

Li, Ker-Chau (1984). ‘Consistency of cross-validated nearest neighbor estimates in 
nonparametric regression.’ Ann. Statist., 12, 230—240. 

Liero, H. (1982). ‘On the maximal deviation of the kernel regression function estimate.’ 
Math. Operat. Statist., Ser. Statistics, 18 (2), 171—182. 


Q* 


132 Chapter 1. Parameter estimation and testing hypotheses in nonlinear models 


Makowski, G. (1974). ‘A rate of convergence of a distribution connected with integral 
regression function estimation.’ Ann. Statist., 2, 829—832. 

Millar, P. W. (1979). ‘Asymptotic minimax theorems for the sample distribution 
function.’ Z. Wahrsch. verw. Gebiete, 48, 233 —252. 

Millar, P. W. (1982). ‘Optimal estimation of a general regression function.’ Ann. Statist., 
10, 717—740 

Miller, H. G. (1984). ‘Smooth optimum kernel estimators of densities, regression curves 
and modes.’ Ann. Statist., 12, 766-—774. 

Nemirovski, A. S., Polyak, B. T., and Tsybakov, A. B. (1984). ‘Signal processing by the 
nonparametric maximum likelihood method.’ Problems Inform. Transmission, 20 (3), 
29—46 (in Russian). 

Nussbaum, M. (1982). ‘Optimal L,-convergence rates for estimates of a multiple re- 
gression function.’ Preprint P-Math-07/82. Academy of Sciences GDR. 

Nussbaum, M. (1985). ‘Spline smoothing in regression models and asymptotic effi- 
ciency in L,.’ Ann. Statist. 18, 984—997. 

Nussbaum, M. (1986). ‘Nonparametric estimation of a regression function which is 
smooth on a domain of R*.’ Theory Probab. Appl. 31, 118—125 (in Russian). 

Pinsker, M.S. (1980). ‘Optimal filtering of square integrable signals in Gaussian white 
noise.’ Problems Inform. Transmission, 16 (2), 52—68 (in Russian). 

Pinsker, M.S8., and Yefroimovich, S. Yu. (1981). ‘Estimation of a square-integrable 
spectral density from a sequence of observations.’ Problems Inform. Transmission 17 
(3), 50—68 (in Russian). 

Pinsker, M.S., and Yefroimovich, S. Yu. (1982). ‘Estimation of a square-integrable 
probability density of a random variable.’ Problems Inform. Transmission, 18 (3), 
'19—38 (in Russian). 

Pinsker, M.S., and Yefroimovich, S. Yu. (1984). ‘A learning algorithm for nonpara- 
metric filtering.’ Avtomatika i Telemekhanika, 11, 58—65 (in Russian). 

Prakasa Rao, B. L. S. (1984). Functional Estimation. John Wiley, New York. 

Priestley, M.B., and Chao, M.T. (1972). ‘Nonparametric function fitting.’ J. Royal 
Statist. Soc., Ser. B, 34, 385—392. 

Ragozin, D. (1983). “Error bounds for derivative estimates based on spline smoothing of 
exact or noisy data.’ J. Approx. Theory, 37, 335—355. 

Rice, J., and Rosenblatt, M. (1981). ‘Integrated mean square error of a smoothing spline.’ 
J. Approx. Theory, 38, 353—369. 

Rice, J., and Rosenblatt, M. (1983). ‘Smoothing splines: regression, derivatives and 
deconvolution.’ Ann. Statist., 11, 141—156. 

Rice, J. (1984). ‘Bandwidth choice for nonparametric regression.’ Ann. Statist. 12, 
1215—1230. 

Sacks, J.,and Strawderman, W. (1982). ‘Improvements on linear minimax estimates.’ In: 
Statistical Decision Theory and Related Topics, III, 2 (S. Gupta, ed.). Academic Press, 
New York. 

Speckman, P. (1985). ‘Spline smoothing and optimal rates of convergence in nonpara- 
metric regression models.’ Ann. Statist., 18, 970—983. 

Stone, C. (1980). ‘Optimal rates of convergence for nonparametric estimators.’ Ann. 
Statist., 8, 1348 —1360. 

Stone, C. (1982). Optimal global rates of convergence for nonparametric regression.’ 
Ann. Statist., 10, 1040—1053. 

Susarla, V., and Walter, G. (1981). “Estimation of a multivariate density function using 
delta sequences.’ Ann. Statist., 9, 347—355. 

T'sybakov, A. B. (1982). ‘Nonparametric estimation of signals with incomplete infor- 
mation on the noise distribution.’ Problems Inform. Transmission, 18 2, 44—60 (in 
Russian). 


1.4. References 133 


Utreras, F. (1980). ‘Sur le choix du paramétre d’ajustement par le lissage par fonctions 
spline.’ Nwmerische Math., 34, 15—28. 

Utreras, F. (1983). ‘Natural spline functions; their associated eigenvalue problem.’ 
Numerische Math., 42, 107—117. 

Van der Linde, A. (1986). ‘Interpolation of regression functions in reproducing kernel 
Hilbert spaces.’ Statistics, 17, 351—361. 

Wahba, G. (1975). ‘Optimal convergence properties of variable knot, kernel and ortho- 
gonal series methods for density estimation.’ Ann. Statist, 3, 15—29. 

Wahba, G. (1978). ‘Improper priors, spline smoothing and the problem of guarding 
against model errors in regression.’ J. Roy. Statist. Soc., Ser. B, 40, 364—372. 

Wahba, G. (1981). ‘Data-based optimal smoothing of orthogonal series density estimates.’ 
Ann. Statist., 9, 146—156. 

Watson, G. S., and Leadbetter, M. R. (1963). ‘On the estimation of the probability den- 
sity I.’ Ann. Math. Statist., 33, 1065—1076. 


Chapter 2 


Robust statistical inference in linear models 


a | General remarks on robustness 


The least squares and the standard normal theory are very attractive for their 
flexibility and the wide applicability to complex linear models. They are good 
provided that the normal distribution is reasonably close to the real problem 
at hand and when outliers are of little concern. 

We know in practice that most models will seldom fit the real situations 
exactly. Thus we must pay attention to the question of what happens to 
specific techniques and procedures if the hypotheses on which they are devel- 
oped do not hold. Stigler’s (1973) historical studies show that already Laplace, 
Edgeworth, Newcomb, and Daniell (among many other earlier investigators) 
cared very much about the influence of the basic assumptions. While they 
recognized that mistakes would be made by using incorrect models, they prob- 
ably had no idea how bad the errors really could be, lacking the computer 
backup. 

H. 8. Pearson (1931) may have been the first who noted the high sensitivity 
to deviations from normality of some standard procedures (namely of the test 
for equality of variances). Incidentally, in connection with the same problem, 
Box (1953) first used the term ‘robustness’. 

In the late 1940s Tukey and the Statistical Research Group at Princeton 
began to emphasize the problem to show the shortcomings of the classical esti- 
mators and to establish properties of several really practicable alternatives 
to them, mainly for the case of estimating the single location parameter. They 
rediscovered and investigated the «a-trimmed mean. Later Tukey (1962) 
remarks: ‘We need: to face up to more realistic problems. The fact that normal 
theory, for instance, may offer the only framework in which some problem 
can be tackled simply or algebraically may be a very good reason for starting 
with the normal case, but never can be a good reason for stopping there.’ 

Moreover, it follows from results of Kagan, Linnik, and Rao (1965, 1973) 
that the least squares estimator coincides with Pitman’s estimator correspon- 
ding to the quadratic loss if and only if the basic distribution is normal; this 
means that the admissibility of the least squares estimator with respect to 
quadratic loss is a characteristic property of the normal law (see also section 
2.1.7 of Bunke and Bunke, 1986). 


The classical procedures are highly sensitive to the gross errors (i.e. to the 
outliers and long-tailed distributions): 10°, of the outliers with standard 
deviation 30 contribute a variance equal to that of the remaining 90% of 
the cases with standard deviation 10. The outliers can double or triple the 
variance, so that cutting out their effect could really increase the precision. 

In the light of these facts, we must seek statistical procedures that are good 
not only for one model but also for a broad class of possible underlying models; 
they need not be necessarily best for any one of them. Box and Anderson 
(1955) introduced the notion ‘robustness’ as follows: procedures are required 
which are ‘robust’ (insensitive to changes in extraneous factors not under test) 
as well as powerful (sensitive to specific factors under test). 

When speaking about robustness, we must keep in mind two points. First, 
the set of distributions (or parameters, or vectors of observations) over which 
the procedure is to be robust. The set may consist of the normal distribution 
only, or it may be the set of all symmetric smooth distributions, a selected 
finite set of distribution shapes, a neighbourhood of one shape, etc. Second, we 
must know the property of the procedure which has to be robust. We may be con- 
cerned with the stability of confidence levels, of the power, of the variance, etc. 

One possible formal definition of robustness was given by Hampel (1971): 
for a sequence {T',} of estimators, the small deviations in the basic distribution 
of the observations should cause small deviations in the distributions of the 
estimates (both measured by Prokhorov’s distance); this proceeds up to a 
‘breakdown point’, the greatest distance from the supposed model at which 
the estimator tells us something. 

This chapter provides a review of some results on robust estimation in the 
linear model. The area of robust estimation and testing has been a permanent 
focus of scientific interest during the last 20 years. The first version of the 
chapter was prepared in 1976; since then the area has undergone considerable 
development. Hence, the text is far from being a complete review of robust- 
ness, though much of the material on this subject may be found by consulting 
the references. The most complete review of robust procedures, with an em- 
phasize on the M-estimators, can be found in Huber’s recent monograph (Huber, 
1981). Some results on robust estimators can be also found in the monographs 
by Serfling (1980) and Lehmann (1983) and in the extensive paper by Bickel 
(1981). Robust tests and estimators are also investigated, with an emphasize 
on the sequential modifications, in the monograph by Sen (1981). On the other 
hand, the relations of different estimators being considered in this chapter 
are not often mentioned in the literature. 

Most of the considerations will be asymptotic for the number n of observa- 
tions increasing infinitely. The exact distributions are available only in several 
special cases; all the above mentioned monographs are mainly based on the 
asymptotic theory as well. Some finite-sample results, based on Monte Carlo 
considerations, can be found, among others, in the Princeton Study (Andrews 
et al., 1972). 
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2.2 Robust alternatives to the method of least squares 


We shall consider the problem of estimating the regression parameters of a 
linear model. We want to estimate 6 after observing y(, = (Yin +++) Yan)» 
where 


Yin) = XB + €, (1) 


B = (Bi, %-+5 By)’ is a vector of unknown regression parameters, € = (&, ---, En)’ 
is a vector of errors and X, = ((ar4;) = v7? is a matrix of known regression 
constants (design matrix) of rank p. Most of our considerations will be asympto- 
tic as the number of observations n becomes large and the number of regression 
parameters p remains fixed. Thus, the coordinates of y(,) and of X, depend 
on n; we shall not indicate explicitly this dependence unless it causes confu- 
sion. Throughout we shall suppose that €;, 7 = 1,...,, are independent and 
identically distributed with the common distribution function # and density f 
with respect to Lebesgue measure; F' and f are generally unspecified. 

If F is normal with mean 0, the appropriate procedure is to minimize the 
sum of squares 


= (v =: vi) (2) 


t=1 j=1 


or, equivalently, to solve the system of equations 
n p i 
»; (vy: — Subs) 2 = 0, {= Lees PS (3) 
i=1 k=1 
The least squares estimator 
Bi = Wr Xin) » where gs = D9. G5 (4) 


is admissible with respect to the quadratic loss if fae: only if # is normal (see 
2.1.7 of Bunke and Bunke, 1986). 

For the location submodel (p = 1, x;; = 1) three different classes of estima- 
tion procedures alternative to (4) were considered: M-estimators (estimators 
of maximum likelihood type), R-estimators (estimators based on ranks of 
observations) and L-estimators (linear combinations of order statistics). 
These procedures lead — in a more or less straightforward way — to exten- 
sions to a linear regression model. 

We shall work with the residuals 


6,(B) = ASS 258; 5 Cen. (5) 

j=l 
A common idea of all these procedures is to replace the function (2) to be 
minimized by some other function less sensitive to the extreme values of the 


residuals (5). 


2.2. Robust alternatives to the method of least squares (S37 


2.2.1 L-estimators 


In the location submodel, L-estimators are the linear combinations of order 


statistics. If yO <--. <y™ are the ordered observations, the estimators are 
of the form 


n 
B= Day”. (6) 
w=1 
If the coefficients 2; are generated by a suitably chosen weight function J 


1 
such that if J(u) F-(u) du = 0 (this condition guarantees the identifiability 


0 . 
of the parameter) so that 4; = n-“1J 2 


, += 1,...,n, and various other 
n+1 


regularity conditions are satisfied (see Bickel, 1967; Chernoff, Gastwirth, and 
Johns, 1967; Shorack, 1969; Stigler, 1974), then n”2(8 — ) is asymptotically 
normal with mean 0 and variance 


Ky(J,F) = ff J(P()) J(Fty)) [F(min (x, y)) — F(a) Fy) dxdy. (7) 
If F is known, then 


tne "HA 7 
where ; 
#(F0) 
(re ea 
4 {(F-X() 


and F-1(¢) = {inf a: F(x) >t} yields an asymptotically efficient estimator, 
i.e. one which achieves the information inequality lower bound as n —> co 
(Jung, 1955; cf. also Theorem 2.4.12 of Bunke and Bunke, 1986). 

Of particular interest from the point of view of robustness are the «-trimmed 
means corresponding to 


1 
J(t) = 4 1 — 2 


if «ost<1—« 


0 . otherwise. 


L-estimators are computationally appealing and have further attractive 
properties in the location model (cf. Bickel and Lehmann, 1975). However, 
they do not extend to the linear model in a straightforward way. A possible 
regression analogue of L-estimators was suggested and studied by Bickel (1973). 
His estimators, defined in the two-step way with the aid of an initial estimator, 
have good efficiency properties but they are computationally complex. 
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Koenker and Bassett (1978) extended the concept of the sample quantile to 
the regression model in the following way: for « € (0,1), the «-regression 
quantile B(x) is defined as a solution of the minimization problem 


with respect to ¢ = (t,,...,t,)’, where 0,(y) = y(« — I[y < 0]), y € R. The 
solution of this minimization is generally not uniquely determined but we can 
always give a rule which selects one of the set of solutions (the asymptotic 
behaviour of estimates is independent of this rule). The regression quantiles 
seem to provide a basis for an extension of L-estimators and relative procedures 
to the linear model. Koenker and Bassett (1978) also proposed the trimmed least 
squares estimator, which is defined in the following way: to trim off all ob- 
servations satisfying 


P p 
i= aiyBi(a) SO or a 2i;B;(1 — «) = 0 
j=1 j=1 


(0 < « < 1/2) and then calculate the ordinary least squares estimator from 
the remaining observations. The same authors proposed a modified linear 
programming algorithm for the computation of the regression quantiles. From 
the computational point of view, the trimmed least squares estimator could 
be recommended for the practical applications. 

Ruppert and Carroll (1980) showed that the trimmed least squares estimator 
is, under some regularity conditions, asymptotically normal with the covariance 
matrix o?(«, F) W,' with o7(«, F) being the asymptotic variance of the trim- 
med means and W, = X),X,. This estimator was also studied by Juretkoud 
(1983a, b, 1984) under general conditions. 


2.2.2 M-estimators 


We obtain M-estimators of regression parameters if we minimize 


? Q (v = vB) (8) 
=1 j=l1 


t 


instead of (2), where @ is some (usually convex) function. If we differentiate (8), 
we obtain (with y = 9’) the following system of equations: 


n p 
ee my=0, f=1s5P (9) 
t=1 k=1 
which is equivalent to (8) if @ is convex. 


The class of M-estimators was established by Huber (1964, 1967) for the 
location model and extended by elles (1968) and Huber (1973) to the regres- 
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sion model. A detailed investigation of M-estimators can be found in Huber 
(1981). 

If f is smooth and if y = —f'/f, then the M-estimator coincides with the 
maximum likelihood estimator. Moreover, we obtain the least squares esti- 
mator (4) if f is normal. 

M-estimators are generally not scale-equivariant, i.e. they generally do not ' 
satisfy 


Blky,, RCCY kyn) Fa kB(y,, pie ss, Yn) (k > 0). 


To make an M-estimator scale-equivariant, we should supplement it by an 
appropriate estimator of scale. 

Under various regularity conditions, the above authors proved that the 
M-estimator is asymptotically normally distributed (as n — oo and > is fixed) 
with centre # and covariance matrix K(y, F) W;', where 


K(y, F) = f p(w) dF (a) [ f f(x) dy(x)|-*. (10) 


A more detailed investigation of M-estimators may be found in Section 2.3. 


2.2.3 R-estimators 


Hodges and Lehmann (1963) suggested estimators of location based on the 
Wilcoxon and other rank tests; they showed that their asymptotic variances 
could be computed from the power functions of the tests, and that the estima- 
tors never have much lower but sometimes infinitely higher efficiencies than 
the sample mean. 

Adichie (1967), following ideas of Hodges and Lehmann, defined estimates 
of B, and f, in the regression model y, = f; + for; + €&;,7 = 1,...,, based 
on the Wilcoxon test and found their asymptotic distribution. Jureckovd (1971a), 
Koul (1971) and Jaeckel (1972) then extended the procedure to p-parameter 
regression and to the general rank tests. The three corresponding estimators 
are asymptotically equivalent and thus have the same asymptotic distribution 
and efficiency. 

Roughly speaking, we obtain an R-estimator if we minimize, instead of (2), 


nN 
i=1 
with respect to B = (A, ..., By)’. Here Rf is the rank of 6,(f) in (6,(8), E35 6,(8)), 
a,(-) is Some monotone score function, and a, = fe Y a,(2) 
i=1 


If we differentiate (11), which is a piecewise linear convex function of f, 
we obtain the approximate equalities at the minimum: 


¥ (a,(B8) — Gn) x; ~ 0, ae ee OB (12) 


i=1 
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These approximate equations in turn can be reconverted into a minimization 
problem, e.g. 


Dp n 
27 a (a,(R®) = a) 2;;| => min! (13) 

j=1|i=1 
The variant (13) was investigated by Jureékovd (1971a) who proved its 
asymptotic normality. This variant is a direct generalization of Hodges and 
Lehmann (1963) and of Adichie (1967) to the p-parameter regression; the esti- 
‘mators are derived by inverting rank tests for hypotheses about #. The variant 
(11) was investigated by Jaeckel (1972) who also proved the asymptotic equi- 
valence of both procedures. The idea is that (11) could be taken as a measure 
of dispersion of the residuals 6;(f); in fact, if z = (z, ..., Z,)’ are observations 
and R,,..., R, their respective ranks, then D(z) =e (a,(Ri) -- Gn) 2; is trans- 


i=1 
lation invariant, D(bz) = bD(z) for b => 0 and D(z) is small if the z; are close 
to each other. We thus minimize D(6(8)) instead of the proper variance of the 
residuals, as is done by the method of least squares. 

The score function a,(7) is supposed to be generated by a nonconstant, non- 
decreasing square-integrable function y(t), 0 < ¢ < 1, in the following way: 


a 
AO) ; Wea Bey 14 
a,,(?) alee) a n (14) 
If f is known and smooth, then 
{'(F-1(t) 


g(t) = ot, f) = — Fees at (15) 


f(F-*()) 
yields an asymptotically efficient estimator. 


Under some regularity conditions the estimators are asymptotically normal 
with mean # and the covariance matrix K;(y, F) W;,', where 


K,(y, F) = fe) at — ( fo a | ( fe ot f) 7 (16) 
0 0 0 


Besides the solution of (11) of (13), the estimators allow two-step versions: 
start with some reasonably good preliminary estimate, and then apply one 
step of Newton’s method to the corresponding system of equations. Such an 
estimate was investigated by Kraft and van Heden (1972) (see also Section 2.5.2). 

From the above remarks we learn that the three estimation procedures follow 
the same idea: to decrease the possible influence of outlying observations. 
Either of them could lead to an asymptotically efficient estimator in the case 
that the basic distribution is known. In fact, as n —> oo, the estimators are 
closely related to one other. For instance, suppose that the respective J-, 
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y-, and g-functions are smooth and connected together in the following way: 


1 —1 
I(t) = 9") (FW) | foe) “| | 
: (17) 


p(x) = ep(F(a)), ¢>0, 


then the corresponding L-, M-, and R-estimators are asymptotically equivalent 
in probability. The relations (17) depend explicitly on the unknown distri- 
bution F, hence we are not able, for instance, to calculate the value of the 
M-estimator from the known value of the R-estimator, and so on. These rela- 
tions rather show which classes of estimators belong to each other. The asym- 
ptotic relations of different types of estimators are studied in Section 2.6. 


2.3 Properties of M-estimators 


Let us consider the model (2.2.1) under the assumption that we have some 
approximate knowledge of the underlying distribution F’; for instance, suppose 
that F satisfies the following model of indeterminacy established by Huber: 


F=(1—c¢)@6+4 cH, (1) 


where 0 < ¢ < 1 is a known number, @(zx) is the standard normal cumulative 
distribution, and H is an unknown symmetric contaminating distribution. 
Such a ‘model of contamination’ arises, e.g. if the observations are assumed to 
be normal with variance 1, but a fraction of them is affected by gross errors. 
We shall also consider another model of indeterminacy, e.g. 


sup |F(x) — B(a)| <e. (2) 
aelR 
Huber (1964) proposed to take 
~ GP ies Wal Sk 
e(x) = 1 (3) 
bla] —> wif [al > 
and 
x if jal <k 
y(2) = (4) 
k sign a rie ee aa 


respectively, for some k > 0. This choice of yp effectively limits the influence 
of grossly erroneous observations: once a residual exceeds k in absolute value, 
it can be increased beyond any bounds without further changes in the estima- 
ted value of f. Alternative choices of y will be mentioned in Section 2.3.2. 
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The » given by (4) leads to estimates with well-defined asymptotic and 
finite-sample minimax properties in the special case of location (p = 1, 
av; = 1); at least the asymptotic minimax property carries over to the regres- 
sion case. 


2.3.1 Finite sample minimax properties of M-estimators 
in the location model 


Suppose that y, — 8, ..., Yn — B are independent identically distributed errors 
whose common distribution function F satisfies to the model of indeterminacy 
(2). Define 


Ties {r: Svyi-7)< of 


t=1 


with y given by (4), and put 


if 
Te with probability is 
f gue 
1 
| d bis ta with probability = 


where the randomization does not depend on the y;. 
The M-estimator T° of location has the following minimax property: if 
a > Oisa fixed number and k in (4) depends on « and on a through the relation 


e 24'@(a — k) — O(—a — k) = e(1 + e *), (2) 
then T° minimizes the supremum of the inaccuracy function, 


sup max[P(T < 8 —a),P(T > 6 + a)] (8) 
PeF, pe? 


over all estimators T of 6; F is the set of distributions satisfying (2). 


Remark 2.3.1 One may ask why just the inaccuracy function (8) is taken as 
a measure of performance of the estimator instead of, e.g., the variance. In the 
finite-sample case the variance is not an adequate measure for robust estima- 
tors: the longtailed distribution of the observations may lead to the infinite 
variance of proper estimators. 

For instance, the variance of the sample mean is infinite for the distribution 


1 1h 
with the density f(z) = 1 if |¢| < 5 and f(x) = Upy ae the va- 


32 |ax\§ 
riance of the sample mean does not exist for any distribution the variance 
of which does not exist, such as the Cauchy distribution. 
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The minimax property of T° is expressed in the following theorem: 


Theorem 2.3.1 (Huber, 1968) Let T° be defined by (5) and (6) with the function 
wp satisfying (4) and (7). Then 

(i) T° ts translation equivariant. ’ 

(ii) If Yy,---, Yn are independent identically distributed random variables such 


that the distribution of y, — B belongs to the system of distributions defined by (2), 
then T° minimizes (8) over all estimators of B. 


Proof. The idea is the following: one first constructs a minimax test of p—a 
against 6 + a and then one derives an estimate from this test in the manner 
of Hodges and Lehmann (1963); it coincides with T°. 

Let fo denote the density of the standard normal distribution; let Py and 
P, be the probability distributions defined by the respective densities: 


Po(X) = fo(x + @) 


(9) 
P(X) = fo(x — a). 


The likelihood ratio 28 = e* is strictly monotone increasing. Introduce 
Po\X 
the following families of probability distributions: 
Py = {Q € F | Q{(—00, t)} = Po{(—oo, t)}} —e forall #¢ RY} 


10) 
igre {Q € F | Q{(t, co)} = Py{(t, co)} — e forall te R}}. ! 


Suppose that Py n P; = Y (this is the case for sufficiently small ¢). We shall 
construct the minimax test € of Py agains A, ie. the test which minimizes 


max | sup Hg’(é), sup Ho:(1 — €)]. (11) 
Q9€Po QiEP 1 


These minimax tests happen to have a simple structure in our case. We shall 
show that there is a ‘least favourable’ pair Qy € Po, Q; € P, such that, for every 
probability ratio test € of Qy against Q,, 


Hgg(é) S Hal) for all Q € Py fe 
Bo,(é) = Eo,(é) for all Qi € P, 


i.e. which satisfies the assumptions of [A 3.4]. 
We shall show that one version of Q) and Q, is given by the densities: 


(Leer Ae Pale pa(e)) 7a poke 
qo(%) = 4 Pol) lz] Sk (13) 
(1 + ©")? [po(x) + Pil(x)] c>k 
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and 
(1 + e248) [ple) + pz] -@) << —k 
Q(x) = } p(x) jal Sk (14) 


(1+ e-%)-1 polar) + pilz)] @ >k 


where k satisfies (7). Hence the probability ratio of Q) and Q, satisfies 


Tog A a9) ONG) 
o(2) 


and the corresponding probability ratio test between Qo and Q, is of the form 


n 
1 if Dye)>K 
i=1 
Ez*)=)% if Dye) = K (16) 
i=1 
0 if DY yz) < K 
i=1 
the constants K and x are adjusted so that 
Ho é(z*) = Eo(1 — &(e*)) = a, (17) 
where « is the minimax risk. The symmetry of the case implies that K = 0, 
nase 
2 


It remains to show that the test is really minimax, i.e. that it satisfies (12): 
We shall first verify that 


Qi e ae = 0, (en 2 ) 


qi(#1) 
(2) (7) 
giz) = (1) 
Qi =Q <t 
goles) qo(#1) 
for all Q; € P;, 7 = 0, 1. But (18) is trivially true for t < e-24* and ¢ > eek; 
for e- 24% < ¢ < e7% the result follows from (10). 


Suppose that the distribution of 2; belongs to Py, i = 1,:..,n; then (18) 
implies that 


(18) 
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is the largest stochastically provided the 2; are identically distributed according 
to Q. Analogously, if the distribution of 2; belongs to P,, i = 1,...,”, then 


— 


Il CAGE 


i=1 Yo(#i) 


is the smallest stochastically provided the 2; are identically distributed accor- 
ding to Q,. This further implies (12), so that the likelihood ratio test of Q, against 
Q, is really minimax for the problem. 

Now, for any P, 


P(I"(y) > 0) =: P(T® > p) =— P(T* > f) + + PUT** > Bi) 


1=1 


ap v(yi —B) >of Susgaps v(yi—B) zo 


P&(Y1 — B; --+) Yn — B) 

and similarly 
P(T® < B) S Ep[1 — Ey, — B, ».-, Ya — BI: 

In particular, for Qj € ?;, i = 0,1 
Qo(T? > 0) = Hgeé(y) Su (19) 
Qi(T° < 0) = Eo[1 — &y)] Sa. 


If the distribution of y, — 6 belongs to ¥, then that of y, — 6 —a and of 
y, —6 +a belong to Py and A,, respectively. On the other hand, 7T*, 7T**, 
and 7° are translation invariant: 

T(u, + 0, ...,U, + 0) = T(uy, ..., Un) + 8, 6 € IR. This implies that for any 
IP E ae, 


P(T(y) < B —a) = P(T(y — (B —a)1,) < 0) 


= Q(T(y) <0) Sa (20) 
and similarly, 


P(T(y) > B +a) Se. (21) 


Let T be any translation-equivariant estimator. Then its distribution func- 
tion is continuous under Q, as well as under Q,. Indeed, 


Dee se ha al (0, area beg by 24) ey 
so that 
Poj,-..5%,)=¢ ifandonlyit 2, =t—T(0,%, —%,...,0%, —%)? 


10 Nonlinear Regression 
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Consequently, given (7 — %,..-,%p — 1) = (Y2,---, Yn), there is exactly 
one point (a, ..., %,) for which T(x), ..., %,) =¢; namely, x, = ¢ — T(0, ys, ..., 
yn) and x; = y; + %, 7 = 2,...,n. This implies (noting the fact that Qo and 
Q, are absolutely continuous) that 


Q(T (Xi, teey X») =t | X, im XxX, = Yo, 23%; AG, a3 XxX, a Yn} = 0 
for every (Yo, .--, Yn) and every t (7 = 0, 1); hence 
Q(T(X1,---.X,)=t)=O0 forevery ¢ (j = 0,1) 


which was to be proved. Particularly, we have Q(T = 0) = 0,7 = 0, 1. 
The estimator T can be used as a test statistic for testing Py against P,, 
rejecting Py if 7’ > 0. Then 


sup max[Qi(T > 0), Qi(T <0)] =a (22) 
Q5€ Po, QiEP x 


because « is the minimax risk for testing Py against ?,. Hence, no translation- 
equivariant estimator T could be better than T°. In connection with the Hunt- 
Stein theorem for estimators ([A 3.6]), this implies that T° is minimax among 
all estimators of 6. 


Remark 2.3.2 The author does not know whether the finite-sample mini- 
max property extends to the regression model. The problem is also that of an 
appropriate measure of performance of the estimators, analogous to (2.2.8). 


2.3.2 Alternative choice of the w-function 


As we have seen, the y defined in (4) leads to the estimator of location with 
finite-sample minimax property over a neighbourhood of the normal distri- 
bution. We will see in subsection 2.3.4.2 that this function provides an esti- 
mator of the regression parameter vector, which is asymptotically minimax 
over the family of e-contaminated normal distributions. 

Let us mention some other possible y-functions which may be appropriate 
in different situations. 


(a) p(x) = x, x € IR. The corresponding estimator is the least squares esti- 
mator. If we want to limit the influence of gross observational errors, then 
it is intuitively clear that y should be bounded function. 

(b) w(x) = sign x, x € IR!. The corresponding estimator is an extension of the 
sample median to the regression model; it is the sample median in the 
location case. 


It has been argued that the influence of extremely discordant observations 
should be reduced to zero; this means that one should choose a y(#) which 
vanishes for large «x. 
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x if lz] Sk, 


(c) p(x) = 
0 if |z| > k 
(d) If we solve the asymptotic minimax problem for the e-contaminated nor- 
mal distributions with restriction to the functions which vanish for |x| >, 
we obtain the estimator corresponding to the following function: 


19 i gk 

6 tanh fe b(¢ — | uoksrsg 
y(a) = 2 

0 WG =a 

—y(—2) if ass 0 


with b, k depending on « (see Huber, 1969). 


Other functions vanishing outside an interval: 


e 1 ee) ad 

a sign x if a< |z|=6 
e = 2 
te) Bee Pr iereaicl wht b <|z|<c 

c—b 

0 i os [| 
(O<a<6<c). 

ED if (zjsa 
@) ey =}-—Fait a <p 

b—a 

0 if b < |a| 
(0<a<b). 

sin i if |x| <2an 
(g) y(x) = 2a 

0 ie Qasr =. |2)\, 


Despite some advantages, the non monotone p-functions should be used 
with extreme caution: since the corresponding g is nonconvex, the iterative 
determination of the minimum in (2.2.8) may easily lead to a local minimum 
far away from the true minimum. 


10* 
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2.3.3 Computational aspects and numerical algorithms 


Usually the minimization (2.2.8) does not provide scale-equivariant estimators. 
Hence, estimating the scale parameter simultaneously with regression para- 
meters has been suggested, where the function to be minimized is of the form 


n 1 Pp ; 
> oe e y. —> vb) = min! (23) 
i=1 oO j=1 
and the minimum has to be found under the constraint 
et 1 Pp 1 
atce yi — > eB; =—(n—p)y (24) 
i=1 Oo j=l 2 
with 
y = 2B oxy) (25) 


(the expectation is with respect to the normal distribution; (24) and (25) 
guarantee that the estimator of o is asymptotically unbiased for normal 
errors), and 


x(x) = xy(x) — (x), p(w) = @'(@). (26) 
The minimization (23) under (24) may be proved to be equivalent to the 
minimization of 
n 1 p 
9(B, 0) = Dd) o@ G E ee a) + ao (27) 
i=1 oO pt 
where 


1 
Ute NOES TA (28) 


We will restrict ourselves to the function @ defined in (3). Despite its simple 
form, the solution of (27) cannot be found by a straightforward calculation 
but has to be done iteratively. 

The function y thus has the form (4) for a given k > 0 and the function g in 
(27) is convex in (8, o). Hence, unless the minimum (f, 6) occurs on the boun- 
dary o = 0, it can be equivalently characterized by (p + 1) equations 


. bj 


= (i) . (30) 


P A 
with 5,(6) = y¥; — iaj,b; anda defined in (25)—(28). 
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We shall briefly review three algorithms for solving (29) and (30). All algo- 
rithms are iterative and improve trial values BO, o™) to BOD 7 oD, m = 0, 
1, 2, ... stepwise. 

We shall consider the model (2.2.1) and assume that the errors €; have the 
same variance o”,7 = 1,..., n. Let «; denote the ith row of X; the rank of X 
is assumed to be equal to p. 

The following partition of the index set I = {1,..., n} with respect to the 
function y and to the residuals 6; = 6,(f), 1 = 1,...,n, will be used in the 
sequel: 


I_ = I(f, 0) = (i : 6; < —ko} 
= I)(, 0) = {i : |8;| < ko} 
Pine  .G \i— Kb Og > NO} 


Let Cy be the matrix such that the ith row of Cy is equal to x; for i € Ip, while 
the other rows of Cy consist of zeros. The gradient of g with respect to 6 for 
fixed o can be written in the form 


Vo —= —— Oly + Cab — bf Ba, — za]. 


1€ 1 a€eI_ 


The matrix of the second derivatives with respect to f is then 


2 ee ee rey 
Aral acre 
Os CP) iia bin * o 


We shall now give a description of three types of algorithms. 


Algorithm H (adaptation of the nonlinear least squares algorithm) 


We need the starting values 8, o and a tolerance level ¢ > 0 (say e = 107%). 


(1) Put m= 0 
(2) Compute residuals 6” = y; — >” apy, 4 = 1, 


(3) Compute a new value for o by 


n (m) 
(o(™+1))2 — 1 (a(™) )? yy a . 


2a i=1 o(m) 


(4) ‘Winsorize’ the residuals, i.e. compute 


6”) 
Ap = omy | : ) [fra Wane YD 


> 
gt) 


(5) Solve X’Xr™) = XA™ with respect to 7™. 
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(6) Put pom) = pom + grim, where 0 <q <2 is an arbitrary relaxation 
factor. 
(7) Stop if the parameters change their standard deviations by less than « 


times. 
lige Peco) Vejj for all 7 = 1,...,p, where G,; is the jth diagonal element 
of the matrix W = (X'X)-}, and if |Jo@) — o(™| < eo(™), then go to (9). 


(8) Otherwise put m := m + 1 and go to (2). 
(9) Estimate B by B™ and o by of), 


The relaxation factor q in step (6) will be chosen as 
q = [Eoy'(yi)I* = [P(k) — O(—k)y* 


provided 0 <q < 2 (if [O(k) — &(—k)]“! > 1.9, set g = 1.9), where @ is the 
standard normal distribution function. The convergence of the algorithm is 
proved in Huber (1977). 


Algorithm W (adaptation of nonlinear weighted least squares algorithm ) 

Again we start with values 8°, o° and a tolerance level « > 0. This algorithm 
uses a weighted least squares technique and its literation steps are the same as 
those of Algorithm H except for the steps (4)—(5), which are as follows: 


(4) Calculate the ‘weights’: 


(m+1) Ok”) 
pm = 2 »(E] if 6” +0 


Y 26(™) g(mt1) 


pm = — if 6” 0 


(wks ea O8 
define a diagonal matrix P™ with p™ as its ith diagonal element. 
(5) Solve 
X’P™ X (pom 4 om) = X'Pmy 
with respect to 1™). 
The relaxation factor q in step (6) will be put equal to 1. The convergence 
was proved by Dutter (1975a). 
Algorithm S 


Besides starting values f° and o° and a tolerance « > 0, we need an estimate x 
of the ‘downhill’ property of this algorithm (see Dutter, 1975a), which may 
be approximated by x ~ w,/w,, where w, and w, are the smallest and the 
greatest eigenvalue of X’X, respectively. Further, we need an upper bound 
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k, for the squared norm of the Hessian matrix H = ts O5Co, which is computed 
by ee 


a Dp 1/2 
6/4, => (2-4) : 
j=l 
The description of the algorithm is then the following: 
(1) 
(2) see Algorithm H 
(3) 


(4) Find the partition (J_, Jy, J,) with respect to (B™, o™). 
(5) Compute the vector 


wm — up, om) aa (O50) [Coy ae ko™*)5] ade pm 


with 
a= ay, — DK. 
46d i 
(6) Tf jw| < ecm Ve, for all j =1,...,p and |o™) — otm| < egtm+h, 


estimate B by B™ + w™ and o by o™” and stop. 
(7) If fh) = p™ + w™ and o™ yield the same partition as in (4), put 


pm) — fm) and go to (8); 
otherwise compute 
ford — Bom + my 


where the relaxation factor y™ is chosen according to the following in- 
struction: 


Define 
aly) = [glB™ + yore, ofm*D) — g(B™, of™)] 


J [vy (w™ J -Vg(p™, gmt) )}-1 


with 


yl, of) = [Chy + homs — fC yp] — 
oC ) 


1 


‘ 1 
and choose 7™ as the largest element in the sequence 1, pete Eas 


for which «(y™) > C, where0 << (<1. 


Go to (9). 
(8) Lf jo™) — o™| < eo(™, estimate 6 by BO and ao by o(”*, and stop 
the procedure. 
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| 12 
(9) Compute o(™*?) — om with 


2 


= Ziyi —# — %; (CoCo) * Coy, 
br = BE [ol (C600) (Be, — Da.) + # (E14 D1) — lo —a)y 
i€T g JET 4, jed- ied 5 ie T 


and 
Gers) = (C009) [Cod + hotm*2)s]. 


If the partition (J_, Jo, J,) has not changed, take (B("*», o(™*®) as an 
estimator of (6, «) and stop. 
(10) Put m := m + 1 and go to (2). 


The computation of the matrix (CjC))"! may cause some difficulties because 
O5Cy may be singular, as discussed in Dutter (1975a). 

Algorithm S is preferable if a high accuracy is needed; Algorithms H and W 
are much simpler to code and a single iteration can be performed simpler and 
faster. The algorithms and their variants are described in detail in Huber 
and Dutter (1974) and Dutter (1975a, b); the latter work provides a list of the 
most important programs and some programs in the form of subroutines. 

Dutter (1975b) compares the algorithms and their variants from the point 
of views of computational times and numbers of iterations. 


2.3.4 Asymptotic properties of M-estimators 


The asymptotic theory in this section considers the case that n — co and p 
fixed. For some results concerning the case the number p of parameters is 
allowed to increase with the number n of observations, we refer to Huber 
(1973). 

The asymptotic normality of M-estimators of regression parameters has been 
proved under various regularity conditions. We shall prove one of the results 
in subsection 2.3.4.1. 

Subsection 2.3.4.2 will be devoted to the asymptotic minimax property of 
M-estimators (and simultaneously that of the relative L- and R-estimators) in a 
neighbourhood of a given unimodal distribution, e.g. the normal distribution. 


2.3.4.1 Asymptotic normality of M-estimators 


For n = 1, 2,..., let us consider the model (2.2.1) under the following system 
of assumptions: 


(A1) f(z) = F'(x) exists, is absolutely continuous, and has finite Fisher’s 
information, i.e. 


Ey | eal dF (x) < oo. (31) 
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Put 
Pl) =intle:Pe)SH O<t<1. (32) 


(A2) X, = (oP i up 8 & given (nXp) matrix with the columns z ,, 
j=1,...,p and the rows 4,1 =1,...,0 satisfying the following con- 
ditions (omitting the superscript n): 

0) y=tj+df 1sisn, isjsp 


a 


(b) The vectors 2 = (Hy,...,%4;/, j=1,-.., p, satisty 

why A=eM j=1,..,9, (33) 
where the salar product in (33) is either 0 for al) but a finite number of 
n, positive for a\\ but a finite number of n; if it is positive, then 


lira max (xf)? [z a =0  (Noether’s condition); (34) 
cad | 


twits Amie 


M > 0 is 2 constant independent of n. 
Analogous conditions are to be satisfied by the vectors 27,7 = 1,...,p. 


(c) All pairs j,h = 1,...,p andi,k =1,...,n satisfy 
(4 — th) (hy, — Hes) 20 
(4 — Hy)  — 2g) <0 (35) 
(it — “f) (ot — aif) 20. 
(d) lin W, = W = (wy) %, existe and is a positive definite matrix, 
GrD 
where 
i 
W,= —i4,4,: 
W 
(A3) Let yz), ¢ ¢ BR? be 2 nonconstant nondecreasing function such that 


fre dF(z) < ow. 
2B 


Remark 2.3.3. The assumption (35) of concordance and discordance of the 
vectors £4 a L4,j =1,.---,p, mneans a restriction for the design matrix X,. 
However, it is satisfied on many models, eg., for the polynomial regression 
with 

jas 


Lig = 0, — 


8 


154 Chapter 2. Robust statistical inference in linear models 


In some situations, the validity of the assumption can be achieved by an appro- 
priate design of experiments. 
Let us denote 


o = —f y(z) f(a) de (36) 
JR} 
and 


= f y(x) dF (x) — ( f vz) ee = varp y(X). (37) 
2 


Let £“ be the M-estimator of 6 generated by the function y, i.e. the solution 
of (2.9). Then we have the following result. 


Theorem 2.3.2 Suppose that the assumptions (Al), (AZ), (A3) are satisfied 


for n = 1,2,.... Then n3!2(B™ — B) has an asymptotic p-dimensional normal 
distribution 
N,(0, (x?/@*) W-). (38) 


Proof. Denote by 
n p 
My(B) = & wiyp (v ee abs), Pa A; --P 
$=1 k=1 


the right-hand side of the jth equation of (2.2.9). We shall approximate M (6) 
by a linear function of 6 in the sense of convergence in probability. This will 
be done in the following theorem. 


Theorem 2.3.3 Under the assumptions (Al), (A2), (A3), 
max [n-¥2| M(B) — M(B) + now (8 — p%) “> 0, 
{p :n4/2||B™ — B°|| SK} 
(39) 
as n—> oo, 
for any K > 0, ¢ > 0 and any fixed B° € R?; w,; is the jth column of W. 


Proof. We may suppose without a loss of generality that B° = 0. We shall 
first prove (39) for any fixed sequence {8},,.. such that 1/29 — 4 € IR? 
for n = no, ||A|| S K. For convenience, denote 


Oy pe oe gio) (40) 
and 
Pp 
Ny”( (A) = Dduy (yi - aE didn) = n—U2VEn) (Bim) , ae oer 
k=1 
(41) 


For a fixed h, 1=h Sp, let_A, = {6:6, = 0 fork h,k = 1,..., p}. Let 
us fix j, 1 <7 < p. We shall first prove a lemma. 
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Lemma 2.3.1 If 6 € A), then N}”(B) has an asymptotic normal distribution 
N(—A),@wp;, t?w,;;) , asn—>co; f=1,...,p. 

Proof of Lemma 2.3.1. For convenience, denote 
d= 4d), 0, = Cr ise ed Reid Pinna ta Bente 


Let us introduce the likelihood ratio 


On account of (A2) and of [A 2.5], the densities [] f(z; + d;) are contiguous 


n w=1 
with respect to the densities [| /(x;) (see also [A 2.2]). Moreover, [A 2.5] implies 
that —<- 


Py How ta — + > AIF) wm 


zo, as 2 —> Co (42) 


where 


1 
The central limit theorem implies that the pair (15%, T, — af Al(F) vn 
is, as n > oo, asymptotically jomtly normal with the parameters 
1 
Wy = 0, Me = —— AL(F) wrp 
: (43) 
o; = TW; 5 oR => Awl (F), O12 = Awwy;. 
It then follows from (41) that (Nj”(0), log Z,) has the same asymptotic di- 
stribution. The asymptotic normality of Nj”(A) then follows from the third 
LeCam lemma (see [A 2.4]). 


Lemma 2.3.2 Under the above assumptions, 


lim max Po{|N}"(4) — N{(0) + A,ow,;| 2 e} = 0 (44) 


n—oo 15jSp 
for any ¢ > O and any fixed A € Ay, 
Proof. Let us keep the notation of the preceding proof. Furthermore, denote 


f= wie) Vo bat. 
Then 


1 
lim if [é(t) — E(t)? dt = 0 


n—co 0 
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where 
é = if O<t<— 
m 
&(™)(¢) E(t) if pear 
m m 
‘cos if ees eet 
m m 
TOL Este Otecs 
y™t) = EF), te R 
and 
N™(A) = Ldiv\Yni — Adi), m = 2,8, .-2. 
Then 
n 1 
ELN}(0) — Nf-™O)P S Da? f (£ — EOP dt Se (45) 
s—1 0 


for m > my uniformly in n = 1, 2,.... 


(45), the contiguity of [| f(z; + 4d;) with respect to [] f(z;), and [A 2.6] 
then imply that Ee =i 


max |W{(4) — N{""™(A)| 240, as n> 00. (46) 
1SjSp 
Furthermore, 
Var [N™™(4) — N%™(0)] < Sa di [ [ye —Ad;) — y™(a)P 4F(e). 
2 (47) 


The last tegral tends to zero for m fixed and as n —> co according to Lebesgue’s 

dominated convergence theorem, for y™, being bounded and nondecreasing, 

has at most countably many discontinuities. On account of the Chebyshev 

inequality this implies that, given any « > 0 and any fixed m = 2,3,..., 
lim max P,{jN{"™(4) — N¥"™(0) — EN}*™(A)| =e} = 0. (48) 
n—>0o 1S5j75p 

According to Lemma 2.3.1, Nj" (A) isasymptotically normal N(—o™ B,w,;, 
TmW;;), and Nj"™ (0) is asymptotically normal N(0, 7?,w;;) with 


o™ = —f l(a) f'(x) de 
and 
Tm = f (ye)? AF (x) — (f p(x) dF (ay), 
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thus, on account of (48) and of lim w'™ = w, we get 


m—->oco 


lim max Po{|N{™™(4) — N%™(0) + A,o™,;| = e} = 0. (49) 


noo 15j]S5p 


The result then follows from (45), (46), (47), and (49). 


Completion of the proof of Theorem 2.3.3 Let us introduce the statistics 
NG(A*, a) = Dai (y. — E att + aeeayy) 
E s (50) 
Np(A*, A**) = Sadie (ys — ¥ (att + astay)) 
and 
I AA* VAST ee NAT, AY teas (Ae, AS). (ai ae ye 


For a fixed j, let x';2,; > 0 for all but a finite number of n. Regarding (35), 
[A 3.8] implies that N¥ is nonincreasing in 4f,..., 45 and nondecreasing in 
At*,..«.¢A5 5 while N}* is nondecreasing in Af, ..., 4, and nonincreasing in 
Ay*, ..., A5* (see also an analogous proof of Theorem 2.1 of Jureckovd 1969). 
Lemma 2.3.2 and the contiguities mentioned above entail that for arbitrary 


fixed A*, A**, 


lim P, | N,(A*, 4**) — N;(0) 
+ oS [AMG a + amar ay] = ‘| 2 (51) 
h=1 


Let Q = {6, ..., 6}, where —K = 0° <...< 6% = K, be a partition of 
[—K, K] such that 


|o(d) — 6@-D)| < c(QpMV?)-1, k=1,...,7. (52) 


Denoting I = {x:||x|| < K}, we get from (52) and from the monotonicity of NF 
that 


p 
max |N*(A*, A**) — N#(0) + o > [ARG4) dh, + A (GY C4") 
A*,A**EL heart 
z * *\/ * KE) TR 
< max Nj (A*, A**) — NF(0) + @ DAG (a) a’, + An (dy) Te t+ €, 
A*, A**eD h=1 


where D = {4 = (A, ..., 4p)’; 4; € Q, & = 1, .--, B}- An analogous proposition 
is valid for N¥*. Thus, we get from (51), (52), and from the analogous inequa- 
lity for N7* that 


lim Pp» 3 max 
n—>00 |4||sx 


£00 j= 1,.. 5): 


p 
NA) — NYO) + wD Aw; 
k=1 


= ‘| =) (53) 
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Finally, we have for each fixed n € IN and7 = 1,..., p 


max ‘a 1/2 


Pp 
M(B) es, M‘"(0) ae men 3 BE ey 


n3!?||B |< kK 
n D n 

= max |S ai] (va mS daft) —viond) | tom? Sper 

n3!2||g||< K |i=1 k=1 k=1 
p 

< max |N)"(4) — NYO) + wD) Ape], (54) 

4K k=1 
which, in connection with (53), completes the proof of Theorem 2.3.3. | 


Theorem 2.3.3 has an easy corollary: 


Corollary 2.3.1 Let {Brow be a sequence of random vectors from IR? such 
that {n¥!2(B™ — B} cq is bounded in probability. Then, under the assumptions 
(Al), (A2), (A3), tt holds for any « > 0 that 


—1/2 Me Bm M(B . 4(n) 0 Pe 0 5B 
n pulps) ad UB) = no 2 (Be — By) We) —> 0, (55) 
as NM > Ww. 


Completion of the proof of Theorem 2.3.2. The asymptotic distribution of 
nl2(3( — 6°) could be proved by means of the above corollary if we knew 
that this sequence is bounded in probability. The following lemma shows that 
this is the case. 


Lemma 2.3.3 Under the assumptions (Al), (A2), (A3), to any ¢ > 0 cor- 
respond K > 0,» > O and no € N such that 


Poof main IMB | <a} <e (56) 


n/?|| Bo) — B9||> K 


for n > n, where 
M(B) — (MOGI), -__ MOE)". 


Proof. Again we may put f° = 0, First, the sequence {n~/?M\"(0)} is bounded 
in probability for 7 = 1, ..., p, since it has a nondegenerate asymptotic normal 
distribution; thus, there exists an %) € IN and a Ky > 0 such that 


1 
Po{n-"2||M(0)|| > Ko} < ae for n> NM. (57) 
Let K, 7 be any pair of numbers satisfying 
| 1 
K > 2Ko/(Aoo); = > oe 


where A, is the minimal eigenvalue of W. Then 
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Theorem 2.3.3 and (57) yield for n > 1, 


Po min 3 6,MIp) < ml Ly (58) 
n1/?||B\|=K j=1 


with 79 = Ky. The left-hand side of (58) is less than or equal to 


Pal min 3° A,M") <m, min Sf, [ro — nw Au 


mr|ip||=K j=1 nil"\B||=K j=1 


oo 2neh aS Po min x B; no) — nw 2 fa << 2reh 


ntl?|[B||=K j=1 


Pp Pp 
<P, | ex 338, sige ~ MY""8) — no ¥ pen > ml 
ni/?||p||=K-j7=1 | k=1 


n?|pB|=K j=1 


<P} max y nk 
*\\B||=K j=1 


M\(B) — Ms”(0) + nw > BW; 
k=1 


“| 


+ Po{—Kn-¥2||M(0)|| < 2 — K2Aqw} > 0 aS n> OO. 


Let p* ie a point with n1/2|6*|| = K. Put 2f = => “ub, t = 1,...,0; then 
j= 
2's) ciy(Yni — T2}) iS nonincreasing in T, so hat for yre= its 
t=1 
P p 
Y (—BF) Mi(76*) = —M(r) = —M(1) = & (—6F) Mj"(6*). (59) 
j=1 


j=1 


Now, if n1/2||6|| > K, then £ = tf*, where B* = n~/?K6/||A|| so that nil2|1B*|| 
= K andr = n¥/2|8|\/K > 1. (58) and (59) then imply that 


Pof min n-¥)MVp)|| < o} 


ntl?||B\|= K 


< Po min > (—f,) M(B) mK IB < nk 


miip||SK j=1 


AN 


Re min y (—B*) M}”(B*) = nl Ze for nS ne. 


ml|B*||=K j=1 
Finally, regarding that M\"(A{?) = 0, 7 = 1,...,p, we get from Lemma 
2.3.3 and from Corollary 2.3.1 that 


nl2(pir) — Bo) — pay n-V2W-1M(po) | 2+ 0 (60) 
w 


160 Chapter 2. Robust statistical inference in linear models 


under f° so that n1/2(8% — 6°) has the same asymptotic distribution as 
eae es W-1M (6°). The asymptotic distribution of the latter sequence 
o 


could be easily found as that given in (38) because each component of M‘(°) 
is a linear combination of p(yn1), ---> Y(Ynn)- This completes the proof of Theo- 
rem 2.3.2. 4 


2.3.4.2 Asymptotic minimax properties of M-, R-, and L-estimators 


For any sequence T' = {T',} of estimators of 6 in the model (2.2.1), let D(T,, F’) 
denote the asymptotic variance of (2'W-1/)-/? A’nV2(T, —B) under F; 


p 
A = (A, ..-, dp)’ is any vector with >) A? > 0. 
j=l 
Considering the model (2.2.1), we may distinguish two situations: 

(i) F is known and smooth. Then we may determine an asymptotically effi- 

cient estimator (e.g., the maximum likelihood estimator and the optimal 

R- and L-estimators, respectively). 
(ii) F is only aproximately known, e.g. it is known to belong to a convex com- 

pact neighbourhood ¥F of a given distribution G. 


Let Fp be the distribution in F which has the smallest Fisher information, 
I(Fo) = inf I(F). Then, for any sequence T of estimators, D(T’, Fy) will be at 
FeF 


€ 
best equal to 1/Z(#'y); our aim is to find a Ty) such that D(T, F) does not ex- 
ceed 1/I(Fo) for any F € J, i.e. Ty which is asymptotically minimax in the 
sense that the inequalities 


D(T, F) S D(T, Fo) S D(T, Fo) 


hold for any F' € ¥ and any asymptotically normally distributed-sequence T 
of estimators of f. 

This section presents an explicit solution of this porblem for the model of 
é-contamination. The following theorem shows that there exists an M-estima- 
tor which is asymptotically minimax for this model. However, due to the 
correspondence between the M-, L-, and R-estimators (see (2.2.17) or Section 
2.6), it immediately implies that the classes of L- and R-estimators also 
contain asymptotically minimax elements. Each of these minimax estimators 
will be asymptotically efficient for the least favourable distribution Fo. 
Theorem 2.3.4 (Huber, 1969). Let 


F = {FF = (1—e)G+ eH | He M (61) 


be a system of distributions with e € [0, 1) being a fixed number; G is a fixed 
symmetric absolutely continuous distribution function such that I(G) < co and 
that its density g is twice continuously differentiable and (—log g) is convex; M 
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is a family of symmetric substochastic measures on IR}, i.e. for each H € M we 
have H(B) S 1 for any Bé B'. Let F, — F be a convex subset of F such that, 
for any F € Fy, either of the following three conditions holds: 


(i) IF) < 00; ae 
(ii) f 2 (1 —e)g, where f = —-; 
da 
(iii) f f du, = 1, te. f is the density of a probability distribution. 

Then there exists a unique Fy € F, such that 


I(F) = inf I(F) (62) 


FEF, 


and, if Ty denotes the maximum likelihood estimator corresponding to Fo, then 
D(T», F) <= DT), F) = DT, Fo) (63) 


for any F € F and for any asymptotically normally distributed and asymptoti- 
cally unbiased estimator T of B. 


Proof. For F € F¥,, we may write 


I(F) = I*(F) (64) 
where 
Vid sup (f y’(x) dF (a))? (f p(x) dF(x))+ (65) 


and @ is the set of function y continuously differentiable on a compact support 
and such that qi ye dF > 0. 
Actually, using the Schwarz inequality, we get 


T(F) = sup (f vf’ da)? (fy? dF)? < I(F). 


To prove the opposite inequality, suppose that I*(#’) < oo. Denoting A:€ — IR! 
the linear functional defined by Ay = f y’ dF, we have 


Ay 
Ale =f — ([*(F))2, 
||A]| = sup ivi (1*(F)) 


lly? = fy dF. 


A, being bounded, extends to all L?(#) by continuity. Thus, there exists a 
function h € L?(F’) such that 


Ap = fyh dF. 


where 


Zz 
Put f(z) = —fh(y) dF(y), « € R1. Then f is the density of F’, is absolutely 
continuous a at =h¢ L*(F). Indeed, it follows from the Fubini 


(x) 


11 Nonlinear Regression 
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theorem that 
fy ak = f ply) Ay) AF) = —f f vw) hy) dFly) de 
y¥<x 
= ‘i y' (x) f(x) da forany ywé€6é. 
Hence, we may minimize I*(F) as well as I(F); since [*(F) is convex in F 


(see [A 3.9]), it suffices to find a local minimum. As we shall see, the criterion 
for (fF) attaining the minimum at Fy € F, has the form: 


1/2\"" 
(° at ) ( — fe) de = 0 (66) 
0 


dF : 
forall Fy S35 f= 1 1 and for an arbitrary constant c. To prove this, 
2 
notice that J(F’) attains its minimum at F% if and only if 


a | >0 for all Fy € Fy 
dt t=0 


where 


= 1 9| = Soin te — , UF Y) — I(Fo)] 
t=0 


dt t>0 


and F, = (1 —t) Fy) + tF,, O<St <1. Supposing that fo/fo is absolutely 
continuous, we may find by direct computation that 


d 

ee Hee } = —4 f ((A?)"/8) (hh — fo) dz 2 0 
t=0 

and this implies (66) due to the fact that i (7; — fo) ‘~ = 0. Now, for any pone 

ax of the set {x: fo(x) = (1 — e) g(x)}, we have f,(z) = fo(x) and (f)?)'"/f? = 

ee the symmetry and the pame of is If x belongs to the aH 


x: fo(x) > (1 — «) g(x)}, then f,(x) — f(z) can take on positive as well as 
Se values; suggests js ea : such that (fi/?)"/f? = ¢ for some 
constant ¢, i.e., fo(z) = a e~*!#! on this domain. 

Let 


g(x) 
g(x) 


Lg = sup {eR : 


“| 


where k is determined by the condition 


2((1 — «)/k) ote +k fate) ar| a8 461) 
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Put 
(1 — e) g(x) if OS '4'< a 
fo(a) = 3 (1 — €) g(a) e *@-™) if gS 2 (68) 
fo( —2) Tac OF 


We shall show that the maximum likelihood estimator Ty corresponding to 
fo is asymptotically minimax for ¥: put 
yo(%) = —folx)/fo(), « € IRI; 
then 
k sign x it |x). 5 
Yo(%) = (69) 
—9'(x)/g(z) if a] > a 


Then y, = 0 (because (—log g) is convex) and D(T), F) is given by Theorem 
2.3.2; namely, 


— 


D(T,F)) 


I 


[1 —e) fyodG + e fy dH]}?- [1 —e) [yp dG + ef yj dH} 
= [(1 — e) fyo dG]? - [(1 — ¢) fyg dG + ek] = (D(To, Fy). 


On the other hand, the inequality D(T), Fy) < D(T, Fo) follows from the 
Rao-Cramér inequality for any asymptotically normally distributed and 
asymptotically unbiased estimator T (ie., D(T, Fo) = 1/I(Fo) = D(To, Fo) 
under some regularity conditions, which are fulfilled in our case). B 


Remark 2.3.4 Let Fo and yo be given by (68) and (69), respectively; put 


Polt) = po( Fo *(t)) 
and. 
Jolt) = yo Fo (6) ( f pola) AFo(x)\, Car =1. 
Rt 


Then the R- and L-estimators corresponding to yp and Jo, respectively, are 
asymptotically equivalent to TM’) and hence also have the minimax property 
(63). 

Example 2.3.1  «-contaminated normal distribution: 


Put G = @ in (61). The asymptotically minimax estimator is then either the 
M-estimator generated by yp defined in (4) with & of (67), or the R-estimator 


iit 
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with the score-generating function 


k if 1>#2 (1 — 6) Ok) + > 
— (e/2 1 
pot) =f or ("Pip Src gam +s 
1—e 2 
ral ae) jie OL 
or the L-estimator corresponding to the weight function 
0 if 1>42(1—6) Ok) + > 
= 1 
Jott) [2(1 — e) (O(k) — 1) |tper at os <t< (1 —¢) O(k)+ a 
Jo(1 — t) i Or 1/2. 


2.4 Some properties of rank tests 


The R-estimators are derived from the rank tests, so that their properties will 
follow from the properties of the tests on which they are based. Thus, we shall 
deal first with rank tests. 

The main feature of rank-based methods which caused their great popu- 
larity is the weak set of assumptions required for their validity. The null 
distribution of the rank-test statistics and the significance level of the tests 
are independent of the basic distribution F and thus are exactly known. It 
is for this reason that the rank tests are frequently called. distribution-free 
or nonparametric, i.e. free of the assumption that # belongs to some specified 
parametric family of distributions. 

First, we shall define the basic entities which will appear throughout the 
present section. Let y® denote the ith smallest coordinate in the vector 


Y == (Yi 5-55 Ya) 80 woav 
yOs <= av =o <= y, (1) 


If (y,, ---, Yn)’ i8 a random vector then the statistic y® is called the ith order 
statistic, and the vector (y™, ..., y”) of order statistics is denoted y. 

Let y = (y,---, Yn) be a vector such that no two coordinates coincide; 
denote by 7;(y) the number of y’s which are < y;, i.e. the rank of y; in the se- 
quence (1): 


YY, Od ad eee oe (2) 
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The statistic R; = r;(y) is called the rank of y;; let R = (Rj, ..., R,). We may 
alternatively write 


n 


R, = Duy — ¥;); (3) 
jes 
where u(y) = 1 if y = 0 and u(y) = 0 otherwise. 

The ranks are defined unambiguously only if there are no ties among the 
observations, i.e. if no two observations coincide; tied observations require 
special treatment. The probability of coincidence of any pair of coordinates 
equals 0 if the distribution function of y is continuous. Let 2 denote the space 
of all permutations r = (ry,...,7,) of (1,...,2); obviously R contains n! 
points. 

We say that random vector y = (y,,...,Y,) satisfies the hypothesis H, 
of randomness if it is distributed according to the density 


Ply, «+5 24) = TT fle) (4) 


where f(x) is an arbitrary one-dimensional density; i.e. if the components y; 
are independent and identically distributed according to some density f. 

For instance, if we consider the regression model, the hypothesis Hy means 
that the regression part vanishes. 

We say that the random vector y = (Yj, ..., Yn) satisfies the hypothesis H, 
of symmetry if it is distributed according to the density (4) with f(x) being 
any onedimensional symmetric density (f(a) = f(—x),r€ RR’). In other words, 
the hypothesis H, is true if the component y; are independent identically 
distributed according to a symmetric density. Obviously, H, implies Ho. 

Let us consider the statistics: 


sign y = (sign yj, ..., sign Y,) (the sign statistics) 
loy| == (ly |, ---> [Yal) (absolute values of observations) 
ly = (ly|®, ..., ly|™) (the order statistics for absolute 
values) 
Rie (lee Bee) (the ranks of absolute values) 
where Ry = > ull — |y;|), heh, Ps. 
pe 


Let = weR®*|7;=— Lory, = —1,1— 1,:.., n}. 

Throughout this section, we shall consider the rank tests of Hy and H, 
against the alternatives of regression in location. We shall formulate the rank 
tests which maximize the local power against these alternatives and prove 
some asymptotic properties of the tests as n —> oo; this will be a starting point 
for investigating the asymptotic properties of R-estimates. 
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2.4.1 Locally most powerful rank tests 


The vector R of ranks of (y;,...,Y,) is the maximal invariant for the problem 
of testing Hy against some rich sets of alternatives, under the group § of trans- 
formations § = O: IR” > R’, g’(x) = (9(21), Pe gteall ts where g runs through 
the set of all continuous strictly increasing functions IR! — IR}. For instance, 
it is the case for alternatives consisting in that the vector of observations de- 
composes in two random samples with different distributions. Unfortunata- 
tely, there is no uniformly most powerful (UMP) rank test for this problem, 
and thus neither is there any UMP invariant test. Thus, we shall restrict the 
set of alternatives and look for a test which is most powerful locally against 
a subset of alternatives near to the hypothesis among all rank tests. The 
situation is analogous for H,, with the only difference that the maximal in- 
variant is here the vector (sign X, R*) and the corresponding group of trans- 
formations is § = {g': IR" R", g(x) = (9(#1), ---, g(@n))}, g runs through the 
set of all continuous, strictly increasing and odd functions IR! > R1?. 

We shall then look for a signed-rank test which is locally most powerful for 
H, against a specified set of alternatives. 

In the case of Ho, a test is called rank test if its test function ¢ is a function 
of R only, é = é(R). The critical region of non randomized rank tests is a union 
of some of the following events: 


Bie eee eX eR 7 | (5) 


1 
Under Ho, we have P(R = r) = P(4,) = Saas A. Let us consider a simple 
n! 


alternative stating that y has a probability distribution Q and denote Q(R = r) 
= Q({y: R = r}). The Neyman-Pearson Lemma (see [A 3.3] of Bunke and 
Bunke, 1986) then immediately gives the most powerful rank test of Hy against 
a simple alternative Q. 


Lemma 2.4.1 The most powerful test of Hy against a simple alternative Q is 
given by 
1 if OR=r)>k 
(6) 
0 if QR=r)<k 
where k satisfies HE(R) = « under Ho. 


In practice, however, the exact evaluation of &(r) is rarely possible because 
Q(R = 1r) is difficult to compute. Then we may try to find the locally most 
powerful rank tests. 


Definition 2.4.1 Consider an indexed set of n-dimensional densities {q,, A > 0} 
and assume that the random vector X with density qo satisfies the hypothesis H. 
A test is called a locally most powerful (LMP) «-test for H against A > 0, if 
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there exists an ¢ > 0 such that the test is uniformly most powerful at level « for 
A = 0 against XH, = {¢4:0< A < 3}. 

If A also may be negative, we shall call a test locally most powerful for A = 0 
against A + 0 if it is uniformly most powerful for H against K, = {q4:0 < |A| 
< ¢} for some « > 0. 


The uniformly most powerful test is also locally most powerful. On the other 
hand, even if the locally most powerful test is not uniformly most powerful, 
its power function increases as rapidly as possible for an «-test in a neigh- 
bourhood of the hypothesis. } 


Theorem 2.4.1 Consider a system of regression alternatives y; = Ac; + «&, 
4 =1,...,n, such that the density of (Yy;, .--, Yn) ts an element of 


{as = Tite (y; — Ac;) ):4>o (7) 
where f is a known density which is absolutely continuous and such that 

Rife) isa | ete 
and Cy, ..., Cy, are known constants. Then the test with critical region 

n 

» CiaAR;, f) 2k (9) 

i=1 
is LMPR «-test (locally most powerful rank «-test) for Hy against (7) where 


x= P{> aR f =H, (10) 


t=1 


where P is any probability distribution satisfying Hy and the scores a,(t) corres- 
pond to f in the following way: 


a,(t, f) = Eo(U™, f), Weal eh; (11) 
where 
f(F-1() 
i, f) = -—————,, Osa H= 1 (12) 
v(t, f) (F0) 


and UY <... << U™ is an ordered sample from the wniform distribution on 
(0, 1). 


Proof. We shall prove that for any fixed r € 2 


lim = [wl QR =r) — 1] = Yoalr (13) 


40 t=1 
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(where Q, is the probability distribution corresponding to q,), which implies 
that there exists an ¢ > 0 such that for 0 < A < « and for any pair7,r’ ¢ R 
satisfying 


n 
» CA, (7 a) = > Cin (Ti, f) f) 
c—1 


+= 1591 


it holds that 
Q,(R = 7) > Q,(R =1'). 


Indeed, we have 


Q4(R =r) = fo J Galas 9 Ye dp, 


1 ag mu 
== 443 le sla Fy. — ded) — fev) TT fay — 46) Tse) a 
where we have used the identity Il A; — 112 =x A; — Bj) TT A; TI Bi 


t=1 j=1 j=l =t4+1 


As 4->0, the integrands of the last integral tend to (—eFw) 7 itu), 
j+i 


t = 1,...,. Moreover, considering first 4c; > 0, we get in view of (8), 


fim ~ fai payne DTT fe , — Ac,) TL #(n) pt 


pe #+1 
=m [ = ity: — 40) )—ro0 =F f fra J} du, a0 


= = el fire | dye, (14) 


and similarly for 4c; < 0; this in connection with Fatou’s lemma (cf. [A 2.17]) 
implies 


lim ee abc md I f(Yn) Aun 


A->0 i=1 k=i+1 
tka nee L(y) 
= x ff: Hy) 4 Hua 
is 
1 Yj 1 2 
ear Se caer = ss Pees 2 Cialis at) 


which entails (13). This, on account of Lemma 2.4.3, completes the proof. ea 


Theorem 2.4.2 Consider the set of alternative (7) where f is a symmetric density 
satisfying (8) and ¢,,...,¢€, are known constants. Then the test with the critical 
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region 


» o sign Xa; (Ri, f) =k (15) 


i=1 


is the locally most powerful signed-rank test for H, against {q4: A > 0} at the level 
oe P| So; sign Xat(Ry, f) = x (16) 
i=1 


where P is any probability distribution satisfying H, and the scores a; (i, f) cor- 
respond to the density f in the following way: 


ease ) 
; Y) 


as (i, f) = Ep eres Se fe! Sy (17) 


where y(t, f) and UY) <... < U™ are the same as in Theorem 2.4.1. 
Proof. The theorem follows from the equality 


yet) 
lim — [2°!Q, (sign y = v, R* = r) — 1] 
40 A 


n 
= ye CVA; (Ti, Ls VE We in te R (18) 


t=1 
which may be proved in the same way as(13). 


We shall call the statistics 


n 
S == ye c,a(R;) (19) 
i=1 
and 
n 
S+ = >¥¢;, sign y,a,(R; ), (20) 
i=1 


the simple linear rank statistic and the simple linear signed-rank statistic, 
respectively. 


Definition 2.4.2 Let B(«, H,,Q,) denote the power of the most powerful «-test 
of H, against the simple alternative Q,, v = 1, 2,.... Then an a-test &, is called 
asymptotically most powerful for testing H, against Q, at level x if 


lim [Bla, ta be Q,) —fé dQ, | ==); 


Lemmas 2.4.1 and 2.4.2 show that the distribution of S (of S*) is the same 
for all distributions of observations satisfying H,(H,). It is not the case under 
the alternatives, so that the power of the tests is not ‘distribution-free’. 

An advanced account of the theory of rank tests as well as a survey of selec- 
ted rank tests may be found in Hdjek and Siddk (1967). 


‘ 
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2.4.2 Asymptotic behaviour of rank and signed-rank test statistics 


Two asymptotic properties of the simple linear rank and signed-rank statistics, 
which will be proved in the present section, are of basic importance for the 
asymptotic theory of rank tests and estimates. The remaining text of Chapter 2 
will be, in fact, their consequence. 

First, as the number x of observations becomes large and some regularity 
conditions are satisfied, the null distribution of the rank statistics tends to the 
normal distribution. 

Second, considering S (or S*) under the regression alternatives (7) as a 
function of A, we shall see that this function is asymptotically of the form 
Sy + 4b, as n> oo with respect to the convergence in probability. This 
property is analogous to the property of M,(f) described in Theorem 2.3.3. 

For n = 1, 2,..., let Yni,...; Yan be random variables and R,; the rank of 
Yni 1 (Ynis +++» Ynn)» Consider the simple linear rank statistics 


S, = Li %niai(R,i), Sr = DL Vni@n(Rni) (21) 
t=1 1 
under the following assumptions: 


(A4) The scores a,(7), a7(2) are generated by a function g(t), 0 <t < 1, which 
is nonconstant, nondecreasing and square integrable on (0, 1), in the 
following way: 


a,(t) = Eo(U), i=1,...,0 (22) 


where UY < ... << U™ is an ordered sample from a uniform distri- 
bution on (0, 1); 


) 
a*(i) = » |——],, Be) Pana 23 
(()=¢9 (- = ) a n (23) 
(A5) The regression constants %p1, ..., Upp Satisfy 
n 1 n 
2 (tai — 2,)° > 0, 2, = — Di) taj (24) 
i=1 nN i=1 
and 
n 1 
lim max (%,; — %,)* |& (2p — 2.) =O; (25) 
n—>0o 1Si<n ja 


(Condition (25) (Noether’s condition) guarantees the uniform asymptotic 
negligibility of the summands in (19).) 


Theorem 2.4.3 (Hajek 1961). Let Ym,---,Ynn satisfy Hy for n = 1,2,... 
and the assumptions (A4) and (A5) be fulfilled. Then, as n oo, the statistics 


/ 
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(19) are asymptotically normal N(ES,, 0?) with 


op = Li (emi — Fa)? { (ol) —gPdt, = f git) ae. 2) 


a7=1 0 0 


Proof. Let &p1,-.-,2nn be the random variables distributed according to (4). 
n 

(i) Consider the scores (22) at first. Denote C? := )'(x,; — %,)?. Consider the 

statistics et 


Il 


Ty = Yi (tui Fx) (FP Yad)) +Fe Sali), = 1,2... QM) 


t=1 


Thew S, = £(T, | A,), n= 1, 2,-.. where 2, = B(Rqy, -.., Ryn) 18 the a- 
field generated by Ry, ..., Ry»; it follows from properties of conditional ex- 
pectation and from [A 3.12] that 


E(T, — S,)? = Var T, — VarS, 
1 


= C ( (p(t) — p)? dé — 


0 


n 
as 2 (ale) G,)? 


= ~ 2 Var 9 F(y,)) = o(C2), (28) 


n 


1 n 
where @, = — )ia,(t). 
n i=1 


(28) implies that o,'(S, — HS,) has the same asymptotic distribution as 
o, (1, — ET,), where ES, = ET, =Z, >) 4,(t); the latter distribution is 
i=1 
N(0, 1), as follows from Lindeberg-Feller Theorem (see Bunke and Bunke, 
1986, [A 4.21]). 
(ii) The proof is more tedious for S¥: we need the following lemma. 


Lemma 2.4.2 Consider two sequences of functions 


28 1 
Pnlt) = An(?) : hoe aay ene 
n n 
(29) 
— 1 
Pr(t) = a7 (2) : te ea ,n 
n nN 
Then 
1 
lim f (pn(t) — o(t))? dt = 0 (30) 


n—>co 0 
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and 
lim | (pr(t) — g(t)? dt = 0 (31) 


n>o 0 


Proof. First, we shall prove that 


Pr(t) > v(t) a.e. with respect to “@,1) 
pr(t) > l(t) a.e. with respect to wo,1)- 


The convergence is obvious for y*. Concerning @,, fix a fy € (0, 1) and take a 


, 4 
sequenee {i,} such that — — f. 
n 


Let 
(ieee tell t's Opa 
(t) = in — 1 
0 otherwise 
be the density of beta distribution B(i,, — 7, + 1) and let G,( =i In(U 
Then g, is unimodal with the mode secisiase to and 
n — 
0 if t< bo 
him G(¢) = (32) 
n—>0o il if tf >to. 


Let 6 > 0 be such that (t) — 6,% + 6) (0,1). Then, regarding that @ is 
bounded on (fy — 6, tg + 6), we have 


to+d 


J ©) — olto)| dG,(e) 


to—d 


\Pn(to) — o(to)| =|f [p(t) — o(to)] d@,(t)| S 


+ i, (ote) —())® ae) | f anther e 


|t—to|26 |t—to|26 


as n>ow. 


This proves (32). Moreover, 
1 
J PO dt = By*(U,) = E(9(U,) | Rn) = J Galt ed ee 
0 


and this in connection with Fatou’s lemma proves (30). 
On the other hand, the functions g(t) are uniformly integrable on (0, 1). 


Indeed, there exists a 6 > 0 for any « > 0 such that i g(t) dt < + for any 
A 
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A & (0, 1) such that (4) = bh dt < 6. Then 
(t)P dt = di etl (a 
J len) Eo (a)e(4o( Neel A 


A 
= if y(t) 


Br n|[(n+1) 


1 
n+1 
n 


y(t) dt <« for n > [6-4] 


1 1 
where #@, = A + ag té+ —:t€ Ap. This together with (32) proves (31). 
n 


(iii) We are now able to complete the proof for S*. We get by [A 3.12] and 
by direct computation, 


H(S, — ES, —S¢ + BS3)* = —— 02 ¥ [ax(s) —G, —a%(i) +a3P 
UD i=1 
1 


Say OE (mally — ald)? = 2 08 (vat — ettn)tae = 010% 
; (33) 


so that o,1(S* — HS*) has the same asymptotic distributionas 6, 1(S,, — HZS,). 


An analogous theorem is valid for the simple linear signed-rank statistics 
(see Hdjek and Siddk, 1967). 

The second basic property of simple linear rank statistics is their uniform 
asymptotic linearity with respect to the regression parameter. 

Suppose that Yn, ---, Yan are distributed according to (4) and that I(f) < oo. 
Let R’, be the rank of X,; + Bday in (Xm + BAm, --->Xnn + Bun), t = 1,..., 2, 
where dy, .--; Inn are given constants and 6 € #. Consider the statistics 


n 
Srp = pa Ly iAn( RF) , ne = = In fn ( (34) 


i=1 


under the assumptions (A 4), (A 5), and two following additional assumptions: 


(A 6) The constants d,, ..., Inn Satisfy 


n ¥. ya 1 2 
DS day — 0) Se, dy = — Ani» n= 1,2,... (35) 
s=1 N i=1 
where M > 0 is a constant, and 
lim Bees (dni — a4) =0, as n->oo. (36) 
n—oo |1SiSn ; 


forall je 1,--5 7 
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The assumption (A 7) of concordance-discordance is analogous to (2.3.35); 
see also Remark 2.3.2. 


Theorem 2.4.4 (Jureckovd, 1969). Suppose that Yn,--->Ynn are distributed 
according to (1) where f is a density with finite Fisher information. Then, under 
the assumptions (A 4)—(A 7), 


ee Sno) 72/2 max |S,g — S? — pal] ay) (37) 
BIS K 
for any « > 0, K > 0, where 
n 1 
b, s ae (pj = En) (dni a dn) J p(t) y(é, f) dt (38) 
i=1 0 


with y(t, f) defined im (12). 
Proof. We may suppose without loss of generality that )} x,; = 0, 22; = 1. 
i=1 tT 
(i) We shall first prove (37) for the scores (22) and a fixed 6 € IR}. 
Consider a sequence {p“(t)},<97 of functions on (0, 1) 


eae? (39) 


a aa 
()(¢) = i — <i 
POt) =¢ (; a i 1 i = 
and put | 

ai) = Fo®(U), i=1,.:.,n; k= 1,2,... 
Introduce the statistics 

Sy =); aja (Re), faa NN We oe (40) 

+=1 


(we are omitting the subscripts n in x,;, R,;, etc.). Then 


E[S,9 —S®P<e for n>mn(k,e) and k> ko(e). (41) 
Indeed, 
‘ 1 ae 
E[Sno — So P S —— > [a,(t) — ap 
nm — | iji=1 
1 
= [ (on(t) — o(e))? at 
n—I1 
0 
where 
grlt)=a,() if <is— 
n n 
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(41) then follows from Lemma 2.4.3. Let us introduce another system of sta- 
tistics 


ve n i ta 1 cee The OL ee 
Tis ae 2 Tn iV (F(Yni Ei B(dni ee d,))), (42) 
i=1 Nia lnterereate 
Then (28) with g replaced by gy“ implies 
lim E(S\) — T)2 — 0, eo i ae (43) 
n—>co 
Moreover, we shall prove 
lim ee ET!) — TYP —0, i 2 es (44) 


In fact, 
Var(T\) — Te) 


= Sab- f [p(#(e + (dys — F,))) — 9(Fle))f aF (2). 


1=1 


The last integral tends to 0 by [A 2.18] as n > o, for F is continuous and 
gy is bounded, nondecreasing, and thus has at most countably many dis- 
continuities. 

The sequence of densities 


{np} = {Ir fy — Bani — d,)} (45) 


neN 


is contiguous wal ee to {Pn}new (see [A 2.5] and thus it follows from (43) 


that Si) 4 7 aL (ey ya co; hence, 


so — Ti! Fe, 0 as noo for k=1,2,... (46) 
[A 2.6] together with (41) imply 

P{|Sng — S| Seh<e for n>m(k,ce) and k>k(e). (47) 
Combining (41), (43), (44), (46), and (47), we get 

P(|Snp — Sy — BT| Sh <e (48) 


for n > n2(k, e) and k > k,(e). 
According to [A 2.13], the statistics T'/) are, for k > k* and as n > ov, 
asymptotically normal N (fb), (o)?) where 


i 
Ie se f ott 
0 


t—1 
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and 
1 


\ i 
of*))2 = (i (p( OS p”) 2 dt, Gh) = fo) dt. 
0 ) 


Moreover, the statistics 7‘) are asymptotically normally distributed 
N(0, (o)?), so that, on acount of (44), HT") in (48) may be replaced by fb}. 
Further, the Schwarz inequality and (30) imply that 


lon? — by] = ¥ apd, — a) | to" — 91 we f) dt 


t=1 


tends to 0, as k — oo, uniformly in 7; so that we finally get 
Sank Sethe Os as cot (49) 


(ii) Consider the scores (23). The convergence (49) for S%, follows from (33), 
from the contiguity of {¢,,} with respect to {p,}, and from (49). 


(iii) It remains to prove that the convergence (49) holds not only for a fixed B, 
but also for maximum over {8 | |8| < K}.This part of the proof is analogous 
to the corresponding part of the proof of Theorem 2.3.3 concerning M,(£). 
We only need to prove that S,, is monotone in f for fixed y,, ..., y, with pro- 
bability 1. But this follows from [A 3.8] and from assumption (A 7). 


Remark 2.4.1 Theorem 2.4.4 presents the simplest version of the uniform 
asymptotic linearity of simple linear rank statistics. This property also could 
be proved under less restrictive assumptions, for the p-parameter case, etc. 
The uniform asymptotic linearity for signed-rank statistics has been proved 
by van Heden (1972). 


Theorem 2.4.5 (van Heden, 1972) For each né€N, let Yni,>:->Ynn be inde- 
pendent and identically distributed random variables with common distribution 
function F satisfying the following conditions: 

F has an absolutely continuous density f 


1 
[ 6, frat <0 
0 


(—t)=f), teR. 
Let p(t),0<t < 1 be a function such that: 


y(t) can be written as the sum of two functions y(t) and y(t) where y,(t) is non- 
decreasing and nonnegative and wo(t) ts nonincreasing and nonpositive 


1 
f yi) dt < co (= 1,2) and f ¥% dt > 0. 
0 


2.4, Some properties of rank tests 
Let Cn, 2-5 Cnn AND Any, ...5 Fyn be vectors of constants such that 


ci; > 0 


is 


oo 


-1 
lim ee (> os) | ==. 0" 


n—>oo |1Sisn j=l 
n 
x 2,< M for some M > 0 independent of n, 


lim max d?; = 0 
n—>o 1SiSn 


and, for each n = 1,2,..., either 


Caidni = O, ol See) 


(lenil — leniel) (ldni] — Idnirl) 20 = forall t,t’ =1,... 


or, 
Caidni SO, — eam 


(lenil — l¢niel) (ldnil — [dn 


Let Riv be the rank of [Yni ea dn il among [Yn y= Bdni|; Oe) Yn 


s Ri \ F 
ar 2 oni (; is ) sign (Yni ra B ni) - 


Then, 
in Sho ae BK > Cai ni 


i=1 


lim P {max |S 


n—>0o \glsec 


1 
ere Ke { way aS dt. 
0 


)20 TOV HN. 4,6 sl ie .. 
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— Bdyn|, and let 


= e(Var Son =O) 


Remark 2.4.2 The uniform asymptotic linearity of rank, signed-rank, and 
other statistics provides a basic tool for proving the asymptotic properties of 


tests and estimates based on these statistics, For instance, 


it enables an 


asymptotic treatment of nuissance parameters in hypothesis testing (see 
Hajek, 1969; Jureékova, 1971b); another application consists in deriving the 
asymptotic distribution of rank, signed-rank, and other estimates and in trea- 
ting the asymptotic relations between them. We shall utilize the uniform 
asymptotic linearity of rank statistics several times in the subsequent text. 


See also Juretkova (1973b) for some more details. 


12 Nonlinear Regression 
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2.5 Estimators of regression coefficients based on rank tests 


Let us consider the regression model (2.2.1). We shall study the properties, 
mainly the asymptotic ones, of the estimator of 6 based on rank tests of Hy 
against regression alternatives. Such are, for instance, the estimators of the 
type (2.2.13) and (2.2.11). 

Let R;? denote the rank of the ith residual 


Si (y he XB) = > (x4; ar Z;) a,(R;*); i a »?~P> (1) 
s—1 
where 
" es 
B= — Dei 
NM i=1 


Then S,;(y) is a simple linear rank statistic defined in Section 2.4.1. If a,(7) 

= a,(t, f), 7 = 1,...,n, (see equation (2.4.1)) then S,,(y) provides the locally 

most powerful test of Ho (i.e., 6 = 0) against the alternatives that yn, --., Ynn 
n p 

are distributed according to [] f {yi — &) ¢:;6;), in a neighbourhood of 0 (see 


tw=1 j=1 
Theorem 2.4.1). The minimization (2.2.13) can be rewritten as 


\Sni(y — XB)| => min! (2) 


ibe 


~. 


The statistics S,;(y — Xf) are step-functions of 6 and their definition could 
be completed by continuity so that they are well defined for all 6 unless some 
components of y, are tied. The solution of (2) is not uniquely determined; 
denote by #, the set of solutions of (2). We shall show that n/2(B, — f) is 


asymptotically equivalent to n-1/2 if: WS,(y) for any 8, € #,; as n > ov, 
YW 

S,(Y) a (Sii(y), OR) Snpl¥)), 0 WAZ ey 

Jaeckel (1972) suggested any solution of minimization (2.2.11) as an esti- 
mator of 6 and proved that the solutions of (2.2.11) and (2.2.13) are asymptoti- 
cally equivalent in the sense that their difference tends to zero in. probability 
as m — oo. (2.2.11) involves the minimization of a convex function of f; the 
function and its derivatives could be calculated everywhere; in fact, the deri- 
vatives are —S,;(y — XB),7 = 1,..., p. 
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Koul (1971) considered the confidence region 
(8: S,(y — XB) W, Say — XB) S k,} (3) 


where k, is the critical value of y? distribution, and suggested the centre of 
gravity of (3) as an estimator of 6. Again, this estimator is asymptotically 
equivalent to the solution of (2). 

We shall find the asymptotic distribution of the estimator given by (1) (and 
at the same time that of any of the two other estimators). We shall show that 
the procedures yield asymptotically efficient estimators by an appropriate choice 
of the scores a,(z). Any of the three classes of estimators contains an asymp- 
totic minimax estimator, defined in Section 2.3. 

The explicit form of the estimator is known only in simple special cases. 
For instance, in the case of shift in location (p = 1; 2; = 0, 1 = 1,...,m; 
x,=—1,1=m-+1,...,n) the R-estimator B of B is the median of m(n — m) 
differences (YY; — Yi), ? = 1,...,.m;7 =1,...,n — m (Hodges and Lehmann, 
1963). 

In general, appropriate computational algorithms have not yet been elabo- 
rated. Any such algorithm should be iterative and every step requires a new 
ordering. From this point of view, the linearized versions of R-estimators are 
more convenient, for their calculation needs only one ordering. 

A linearized version of rank estimators was studied by Kraft and van Heden 
(1972b); it will be described in Section 2.5.2. It consists of a consistent initial 
estimator and an additive term based on ranks. The linearized rank estimator 
could also yield an asymptotically efficient estimator by an appropriate choice of 
the scores. Nevertheless, it is not asymptotically equivalent to the R-estimator. 

The optimal choice of an estimator within one of the classes mentioned above 
depends on the basic distribution #’. If this is unknown then one might use a 
part of the observations to estimate #’, and then adapt the estimator of B to 
this estimated F and in this manner obtain an asymptotically efficient estimator 
of B. This idea was first suggested by Stein (1956). A more detailed description 
of the adaptive estimator can be found in Section 2.5.3. Despite the excellent 
asymptotic properties of the estimators they are not convenient for practical 
purposes unless the number of observations is extremely large. 

We shall deal only with adaptive estimators based on ranks; but regarding 
the close relationships between different types of robust estimators we could 
imagine that the considerations could be modified in order to obtain adaptive 
M- and L-estimators. 


2.5.1 Asymptotic normality of R-estimators 
Consider the model (2.2.1) under the following system of assumptions: 


(A 8) The error distribution F satisfies the assumption (A 1) of section 2.3.4.2. 


(A 9) Let X, = ((ai))iz177% be a known (n X p) matrix satisfying 


12* 
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(a) wy = 2 + af*,¢=1,....257 =1,...,p. 
(b) The vectors x% = (aj, ..-. Uy). 7 = 1,..., p satisfy 
(2% — az)’ (25 — 3) SM, i =~ Sat, fail et DPE) 


NM ij=1 


where the scalar product in (4) is either zero for all but a finite number 
of n or positive for all but a finite number of 7; if it is positive, then 


. lim {mex (ai — as) |= (ah; — 2} | | eal We (5) 


n—>oo (1Si<n 
M > 0 is a constant. Analogous peng us a rae condition) 
are to be satisfied for the vectors x**, 7 = 1,..., p. 

(c) The inequalities (2.3.35) hold for all pairs 7,h = 1,...,p and 4,k 
aod Vary 


(Qehm dy p= Ss ((aj) ay exists and is positive definite, where 


n—>oo = 
a ((whp) \ir¥ a BS and 
(n) 1s = - 6 
Wir = ae 2 (i TR %;) (Liz cr Xx)s )> b= 1, see P- (6) 
i= 


Remark 2.5.1 Remark 2.3.3. applies to the assumption (A 9) as well. 


. (A 10) Let S,;(y — XB), f = 1,..., p, be the statistics of (1) with the scores 
a,(t), 1 = 1,...,n, generated by a function y(t), 0 <¢ < 1, satisfying 
the assumption (A 4) of Section 2.4.2, either by (2.4.28) or by (2.4.29). 


1 
y := | v(t) oft, f) de (7A) 
0 


with g(¢, f) defined in (2.4.12) and 


1 


oe = { (g(t) — gat, G = f olt) at. (7B) 


0 
Let #, be the set of solutions of the minimization (2). We shall accept any 
point of #, as an estimator of /. 


Definition 2.5.1 We say that n2(8™ — 8) is asymptotically normal N p(@, A) 
pointwise over the set B,, if there exists a sequence of random vectors {Tn} new such 
that n!2(T,, — B) is ee normal N (a, A) and 


sup ||n/?2(8, — T, I) —2> Oe asad Oo. (8) 
Bn& Bn 


The main theorem of the section follows. 
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\ 
Theorem 2.5.1 Under the asswmptions (A 8)—(A 10), n12(B™ — B) is asymp- 
totically normal 


2 
Np (0, 76 z) pointwise over #. (9) 


Proof. The proof follows ideas similar to those of the proof of Theorem 2.3.2. 
Our main tool is the uniform asymptotic linearity of rank statistics. We shall 
first extend this property to the multiparameter case. 


Lemma 2.5.1 Under the assumptions (A 8), (A 9), (A 10), 


lim max Ppt max 72/2 |Snily — XB) — Srj(y — XP) 


noo 15jsp n'/2||gn— p°ll< K 
+ nyo.(B — Bo)| = ‘| oy (10) 


for any K > 0, e > 0 and any fixed f° € IR”; o.; the jth column of 2. 


Proof of Lemma 2.5.1 We may suppose without loss of generality that ¢ is 
nondecreasing and that bes = 0. For each h, 1S hsp, let A, = {4 € R? | 
Ay = Ofork +h, k =1,..., p}. Then it follows from Theorem 2.4.4 that 


lim max Po{n-V?|S,(y — XB) — S,;(y) + nyByo,;| = e} = 0 


n—>co 15jsp ( 


for any « >0 and for any sequence { em Laem Such that n/2g™ — A € A,, 
Neem ae SP Ee 


p 

Let &; (B*, B**) be the rank of y,; — Xe xe + xi*BF), 4% Atealts 
where {B*} = {(B*}, 7 and {f**} = pawn) | ew are two sequences of vectors 
from R? such that n4/2g* = AM, n/2p** — A®,n = 1, 2,..., ||AM|, ||| S K. 


Introduce the statistics 


S(B*, BY) = ¥ (oh — Ft) a,(R(6*, 6**)) 
dence (12) 
St,(B**, B**) -— Y (att — zt) a,(Ri(B*, B**)) 
and 
S,(B*, B**) = S8t(B*, B**) + SHB BM), GF —1-p- (13) 


The sequences of densities {9%}, {q7,*}, where 
n Pp 


and 


- - we TR I 
dn (y) = at a — 65°) Bj ) 
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are continous with respect to p,(y) = in f(y;), » =<1, 2,... (see [A 2.5)). 


This in connection with (10) implies trae 


lim P, ‘ae S,,;(B*, B**) — Sr;(y) 


n—>co 


+ S60 ie — Ue)! (w%, — BE) + Ber (ait — Bry’ (whet — 24") 


ze =o, 


(14) 


On the other hand, it follows from (2.3.5) and from Jurecékova (1969, theorem 


2.1) that S%; is nonincreasing in ff, ..., 65 and nondecreasing in fy*, ..., Bp", 
while S7F is sions stiss in ff, ..., 8, and nonincreasing in fy", ..., 65*. 


The rect of the proof is quite alee to the corresponding part of the 
proof of Theorem 2.3.3 (see (2.3.51) —(2.3.53)). Hi 


Consider the sequence {v,} of random vectors 

VO, = ne = + 18,(y — XB°) (15) 
where S,(y — XB°) = (Su(y — DGB) eS se X6°)) and f° is the true 
(unknown) parameter value. Then Titer 2.4.3 implies that v, is asymptoti- 
cally normal N, (0, ee =) . The proof of Theorem 2.5.1 will be complete if 
we prove the aia lemma. 
Lemma 2.5.2 Under the assumptions (A 8)—(A 10), 

sup |[n1/( B, — Bo) — Vall > 0 as n—>00 for any «> 0. 

BnfBn (16) 
Proof. We may put 6° = 0. Denote 

TQ = (6 € RP | ni! |p| < Kp. 
Lemma 2.5.1 and the continuity of the operator 2-1! imply that 


lim Py) sup |n¥?98, — V,l| =e, Fn IY +0 
BnEBrnI , 


n—>-co 


n—oo 


= lint Ps | sup ne 
mB all SK 


‘ oa nl — XBn) — Sry) + 2B n z= 0 
(17) 


for any e > 0, K > 0. Moreover, there exist K* = K*(e) > 0 and m = nj(e) 
such that 


Le Semen ess hy Se ae) 
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To prove (18), take M > 0 such that (—M) < ra where @ is the stand- 
p 
ard normal distribution function and K* and 6 satisfy 


2\1/2 
se ecclesia 


Myp'?x (19) 
Ao 


ro] 


where Ay is the minimal eigenvalue of 2. Then there exists a positive integer 
m, such that 


Pot min (—f’-S,(y — XB)) < onl La ee (20) 
n¥/2\||< K* 2 


hold for all n > n, with 6* = 6K*. 
Actually, the left-hand side of (20) is bounded from above by the sum 


Py| min (—6’- S,(y — XB)) <6, 
n*!?||B|| = K* 


fain (—B') [Saty) — ny38] = 20+] 


ni!?||p|| = K* 


+Pe min (=P) [S,(9) — nyBp) < 2644, (21) 


nl?||B||= K* 


The first term of (21) is less than or equal to 


Po max (—F')[Saly) — myEB — Suly — XB] & O* 
n}/?||B\| = K* 


P O* 

<= Po} max nl?) |S,i(y) — nyo iB — Saj(y — XB)| = a ai0 
maigj=—K* — j=1 K 

(22) 


n —> co on account of Lemma 2.5.1. 


The second term of (21) is less than or equal to 


p 
Po{n-¥? |8,(y)|| > K*Aoy — 20} S x nH? |S ,(y)| = oN} 


(23) 


IIA 


€ 
— for n>% 
4 

in view of (19) and of Theorem 2.4.1. (20) then follows from (22) and (23). 


Lf 8, € IR? — 1%) then m1? ||6,|| = K* for By = eo nl2 and 


; 1 (_ pay gy —X 
2 lSuily — XB) 2 aq (PY uly — 8) 
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so that, according to (20), 


Pe{ min ih ¥ suly — X <3} 


pelR?—I\) pat 
< Pst min (—£'S,(y — XB)) < ox aie (24) 
n/2||B|| = K* 2 
for n > 7. 


On the other hand, Lemma 2.5.1 implies 


P 
P> jin nV? Y S,(y — XB)| = | — for n>, (25) 
j=l 


Ae 
pel? 2 
and (18) follows from (23) and (24). 
Finally, applying (17) and (18), we get 


P, | sup |in'"B, — v,l| = ‘| 


BrEBn 


Bn€ Ban 2 


= Po sup ||n¥/28, — v,|| 22,82, 9 IM + at 


ci Po sup ||n2?6,, sce Vall = é, sup |jn4?2B, re all <6, 
Bnf Bn BrEBn—I 


By 9 TE OF + Pal By — IR 0) 0, as n—->oo. Hl 


Corollary 2.5.1 If y(t) = ot, f),0 < t < land if the assumptions (A 8)—(A 10) 
are satisfied, then n4!2(B,, — B°) is asymptotically normally distributed 


1 
N,, (0, —— 271 
I(F) 
pointwise over B,; hence, &, provides an asymptotically efficient estimator. 


Proof. The asymptotic covariance matrix of (9) reduces to 


1 1 —2 1 
o2/y2) 5-1 = | p(t, f) dt ( [ e(t,f)dt) 2-2? =—— 5-1. 
(8) 24 = fort fae ([ 9460 at) mee 
Table 2.5.1 gives the asymptotic efficiencies e(#) of the R-estimator based 

on the Wilcoxon test (i.e. y(t) = 24 — 1, 0 <t < 1) with respect to the least 
squares estimator for the normal, double-exponential, and logistic error distri- 
butions (the efficiencies are maesured by the ratios of determinants of asympto- 
tic covariance matrices). 
e(F’) never falls below 0.864 and can be infinite. Hodges and Lehmann (1963) 
showed that both bounds can be attained by specific distributions. 
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Table 2.5.1 
F e(F) 
normal 0.955 
double-exponential 1.500 
logistic 1.097 


2.5.2 Linearized rank estimators and their asymptotic distribution 


' Let us keep the assumptions (A 8), (A 9), and (A 10) of Section 2.5.1. Let p, 
be a sequence of initial estimators satisfying the assumption of equivariance: 


(A 11) B,(a(y — XB)) = a(B,(y) — B) (26) 


for all 8 € R? anda > 0, 


and there exists a function ¢,(t),0 < ¢ < 1, satisfying assumption (A 4) 
of Section 2.4.2 and such that 


ml E B ( fa op a)" 280%] (27) 
) 


tends in probability to zero under the hypothesis 8 = 0, where 
Se) (SW). sas only) 


and 


= s 


Let g be a function satisfying the assumption (A 4) of Section 2.4.2 and 
let S,,(y — XB), k = 1,..., p, be the statistics given in (1). The linearized rank 
estimator of 6 is then defined as 


6=6.+ + W,Syy — Xf) (28) 


where W,, is given by (6) and a? by (8). 


Remark 2.5.2 The least squares estimator satisfies (A 11) with g(t) = F~*(4), 
0 <t < 1 (see Section 2.6). 


Denote 


a= (nl) Ard, A= { mlt)det 
0 0 
1 
n= f elt) ot f) dt (29) 
0 


O90) ae f(a@ am f1) (y(t) a 7) dt. 


0 
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The following theorem states that the linearized rank estimators (27) have an 
asymptotic normal distribution. 


Theorem 2.5.2 Suppose that the assumptions (A 8)—(A 11) are satisfied. Then, 
as 1 —> oo, nil2(3 — B) have the asymptotic normal distribution N (0, xX~1) 


-9(-2)24(- 
Corollary 2.5.2 Jf g(t) = l(t, f),0 <t < 1, than nil2(B — B) has the asymptotic 
normal distribution N, | 0, Tuy =) for any initial estimator Be B is then an 
asymptotically efficient estimator of B. 

Proof of Theorem 2.5.2. We can suppose that 8 = 0. The following lemma is 


an easy consequence of Lemma 2.5.1. 


Lemma 2.5.3 Suppose that the sequence { Bi} of random vectors is asymptotically 
bounded in probability. Then, under the assumptions (A 9) and (A 10), 


n-2 \IS,(y — XB) — Saly) + yrZB,|| 2+ 0, as n—->co. (30) 


Definition 2.5.2 The random vectors u, and v, are called asymptotically P- 
equivalent (denoted U,~v,) if P{\\u, — v,|| > 2} +0, as n> oo, for any 
> 0. 


Lemma 2.5.3 and (26) imply that 


mis mw acanss| 7 (1 — 2) sire + sa). (31) 
V1 oo a? 
[A 2.14] implies that 
AAS) nAPT MY), nAPS (y) ~ 2PL(y) (32) 

where 
PG) Ee bap) 
Try) = (Tai(y); 9% Pag): 

and 


(33) 
Ty) = X (wy —%) (Fy)), F=1,...p. 


The asymptotic distribution of n/2( — 6) then follows from (31), (32) and 
(33) and from Bunke and Bunke (1986, theorem 2.4.3). i 
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2.5.3 Adaptive rank estimators 


The optimal rank or linearized rank estimates are based on the score function 
p(t) = y(t, f) and thus they may be determined only if we know the corre- 
sponding error density /. More precisely, it suffices to know the type of f only, 
for the optimum rank tests are invariant with respect to changes of location 
and scale of f. 

However, we usually also do not know the type of f and just this lack of 
knowledge, in fact, stimulates the use of rank tests. Then we either prefer the 
simplicity to efficiency, and make use of some common rank tests (e.g. the 
Wilcoxon test), or may try to estimate f from all or part of the observations. 

Since the rank statistics depend on f through g(E, f), we shall try to estimate 
the latter function. The estimator of 8 will be either the rank or linearized rank 
estimator based on the estimated q(t, f). 

We shall briefly describe three adaptive procedures. They differ in the way 
of estimating g(t, f). 

The first procedure was suggested by Hajek (1970) and consists in selecting 
one of & distinct density types ¥,,..:, 7; generated by densities /,, ..., fz, 
i Ween Vek on Gn: 


Fs = {f: f(x) = Afj(Ax — 2), u € Rt, A > 0}, path ey ese | E(S4) 
Let Yni;---»Ynn be independent observations such that y,; — Px,; has a 
k 
density { which belongs to UF; but otherwise is unknown (7 = 1,..., 7). 


jg=1 
We shall try to find j such that f ¢ #;. The decision procedure 6(2,, ..., v,) 
takes on values in {1,...,k}. Let L(j,d) be the loss corresponding to f € F; 
and d(y) = d; 


if j=d 
i ee (35) 
[ner d, 


Restricting ourselves to the procedures invariant under the group of positive 
linear transformations, z(x) = ax + b,a > 0, we get the risk of the procedure 
given f € F;, 


R(j, 6) = 1 — P{Y;, ---5 Yn) =F | FE Fj}. (36) 


Let us fix 8 > 0 and compute the ratios 


ty = (3 ms — BF)” TU" ESPs + Ben) — SPMD 


Syn + Ban) = Xi Eni — %y) a(R), Mele (38) 
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Ré is the rank of Ypj + Bon; among Yn + BXq1,---> Yan + B%nn» t= 1,...,0 
and 


a) =o(— i] i= dete ats j= 1, +k. (39) 


The 1,; are invariant under positive linear transformations and it follows from 
Theorem 2.4.4 that 


iim n Pa Ha ae re z)| (hs)? Boia] = ‘| ai) (40) 
Thee 


where dP, = f;, dy and 
1 
Ojn a | m fr) v(é, fi) a| [I(F;) I(F,)}72, 9, h —— i Bey) ke. 
0 


It follows from (34) and from the Cauchy-Schwarz inequality that 
Op ied fag lor aj cake 9, il eee 


Consequently, the decision procedure 6* 


ISAS 


[o*(y1, sees Yn) = | S [?nilts»-- seey Yn) = a Se LanGa> «> tee ¥o)| (41) 
is consistent in the sense that 
k 
> Rh, 6*) +0, as n—->oo, for any fixed f. 


h=1 


To prove this, we may We in view of (40) for n = m(e) (denoting 


a= | Seu —# — &p | aa 


k k 
R(h, 5*) = ¥ PS*(y) =F |F € Fr) SD P(lnjly) = taal) | f © Fa) 
er jek 


SY Plan( Ita)? Bain + © = an(I(fa))"? Bonn — 
jh 
[l,j = an(Z(fx))¥? Boin| <é, [Un < a,(I(f,))¥? Bo;;| = eb + €; 


and as it follows from (41), the last probabilities are equal to zero for sufficiently 
small «. 

Let us resume how 6* may be applied in estimating. Consider the problem 
of estimating fy on the basis of observations y,,..., ¥, where y; — fx; has 
a density f. Then a proper estimator is a solution of minimization (2) or the 
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linearized rank estimator (27) with the scores a,(2) corresponding to the den- 
sity f in the well-known way. If the type of f is not known, we perform the 
decision procedure 6* based on a part of the observations, say y;,...; ios 
where m, — oo, m, = O(n). We take a certain number of density types and 
compute for them the quantities /,,; applied to y,,..., Ym, and to any fixed 
6 > 0. The density type providing the largest J,, ; is then chosen to generate 
the scores; the estimator is computed from Ym 41, .--, Yn- It need not necessarily 
be an R-estimator but could be an M-estimator or L-estimator as well. 

The procedure then selects one of a finite set F of distribution shapes. It 
has undisputed advantages: the shapes may be properly chosen in order to 
lead to well-known rank tests, etc. The procedure is consistent for any distri- 
bution from F. 

On the other hand, nothing is known if the true distribution is not con- 
tained in #. The following two procedures provide asymptotically efficient esti- 
mators for any regular distribution. The first one is due to Hdjek (1962) who 
utilized it to construct a test which is asymptotically optimal for all f with 
I(f) < co. Van Eeden (1970) then suggested an estimator of the shift in loca- 
tion derived from Hajek’s test in the manner of Hodges and Lehmann. Beran 
(1974) suggested a Fourier series estimator of y(t, {) and the computation of the 
linearized rank estimator based on it: 

Suppose that y,; is distributed according to density fo(y — Bx;),7 = 1,...,” 
such that I(fo) < oo and that g(t, fo) is nondecreasing in ¢ € (0, 1). 

Let {K,} be a sequence of integers satisfying 


K,> ©, K,/n +0, (42) 
let {p,} be a sequence of integers satisfying 
KP - po = KP 41 (43) 


and let {0 = hao < haa <-+: < hag, = K,} be a sequence of (g, + 1)-tuples 
of integers, satisfying 


lim max |ha,ji1 — hn,;|/K2% (44) 
no OSj<qn 
=lm min ln i+ a hn || Kn = 1. (45) 


moo 0SjS4n 


Let y < --- < y'X») be the order statistics of ym,.--,Ynx,- Then Hajek’s 
estimator @,(t) of y(t, fo) based on Yn1, ---» Ynx, 18 given by 


ed 
ibe ao) 


role 


1 
Ko 1 a (46) 
te yshinat Pn) — ylins— Pad) ying t Pn) oo ylbasi— Pa) 
n 
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for 
hnj ee a a Na ja 
Ke n—K,+1 °° &#&424&, 


> 


Piao ens Ore as areca Eatery, weaeaay «GAP 
the definition is completed by taking ¢,(¢) constant on the intervals 


( i—1 i | i= 1,2,...,n — K,; and G,(t) = 0 otherwise. 


ee on 
Then @,(é) is a consistent estimator of g(t, fo) in the sense that 


J 
f (alt) — oft, fo)P dt + 0, as n+ 00 (47) 
0 


where Py corresponds to 8 = 0. (For the proof of (47), we refer to Hdjek and 
Siddk, 1967; see also [A 2.15]). It follows from [A 2.2] and [A 2.5] that (47) 


holds also under any sequence of alternatives contiguous with respect to 
n 


II fo(yi), and hence it holds also under p + 0 and 2,; satisfying >) x,; = 0, 
t=1 n “<7 i—| 
(7 = 1, 2, 2..;) and lim | mex Loa (324) = 0. 

j= 


noo [1SiSn } 

Van Eeden (1970) shows how to modify ¢,(¢) in order to make it nondecrea- 
sing, constant on equidistant intervals, and still consistent. The resulting 
estimator of 6, asymptotically efficient, is then either the R-estimator or the 
linearized rank estimator corresponding to the statistic 


e- . is R=? 

— 5. Ce x Dn at 48 

| % i Bas K,, se ) 

where &;? is the rank of (y,; — Bx,;) among (yx 41 — Box 415 +++) Yn — B2q): 

For a detailed explanation we refer to van Heden (1970) (location model) or 
to Dionne (1981) (linear regression model). 

Another possibility is a Fourier series estimator of (t, fy). If I(fo) < co 

then g(t, fo) has the Fourier expansion 


V(t, fo) = dX d& exp (2a ikt) (49) 
\K]=1 
where 
1 


d, = | lt, f) exp (—2m ikt) de. (50) 
0 y 
If din i8 a proper estimator of d; based on yyy, ---; Yan, then a plausible estimate 
for @(é, fo) is 
a(t) = Xi den exp (2a ikt) (51) 
\k|=1 


where m, —> co at a suitable rate as n — oo. 
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One possibility of estimating d, is as follows: Vx(9) = exp (—2z ikt), |k| 
= 1, 2,..., let 6,, 6, be real numbers such that 0 = 107 = 05, Ie Then 


(k) k 
== S;, a Si 


kn és as 5, (52) 
is an estimator of d;, consistent in probability for all 4, where 
n Ré 
Soe (Qo a 4 5 
6 x n ) Pk hee 1 ( 3) 


and R? is the rank of yn; + 5%q; iM (Yn, + Opi, --+> Yan + O%nn)) i= 1,..., 0. 
An analogous estimator based on identically distributed observations was 
studied by Beran (1974). If {m,} is a sequence such that 


M,—>co and m7/n?>0, as n—> oo (54) 


then @,(#) of (51) is an estimator of g(t, fo) consistent in the sense of (47). This 
fact is proved in Beran (1974). The resulting asymptotically efficient estimator 
of f is then the linearized rank estimator based on the statistics 


Seen (nan Sr ee 55 
a XLni — Ly n ae 
8 )g (- ra ) (55) 
where R&;-* is the rank of y,; — xi in (Yn, — 2B, ---> Yar — Ln). 

A similar approach based on the Fourier series with respect to the Legendre 
polynomials was studied by Huskovd (1983). 


2.6 Asymptotic comparison of different estimation procedures 


We have mentioned several times that there are close relationships between 
different robust estimation procedures. For instance, if the underlying distri- 
bution is known and smooth, all procedures provide asymptotically efficient 
estimators. All respective estimators are asymptotically equivalent if the 
relationships of the corresponding ¢, y, and J functions are such as described 
in Remark 2.3.4. 

The present section is devoted to the mathematical background of some of 
these relationships. 


2.6.1 Asymptotic distribution of the difference of M- and R-estimators 


Suppose that the error distribution F and the design matrix X, of the model 
(2.2.1) and the functions g(t), 0 << ¢ < 1 and (a), a € IR? satisfy the assump- 
tions (A 8), (A 9), (A 10), and (A 3). 

Let w, y and «? be given by (2.3.36), (2.5.7a) and (2.5.7B), respectively. 


192 Chapter 2. Robust statistical inference in linear models 


Moreover, 
1 


p ae (a) dF(a), = { olt) dt. (1) 


0 


Let Bg be the R-estimator corresponding to the function 9g, i.e. bp is any 
solution of the minimization (2.5.2). Let fy oenote the M-estimator corre- 
sponding to the function y, i.e. By is a solution of the system of equations 
(2.2.9). The asymptotic relation between fz and fy is expressed in the follow- 
ing theorem. 


Theorem 2.6.1 Under the assumptions (A2), (A3), (A8), (A10) and for 
y +0, wo +0, the asymptotic distribution of the sequence {n*!2(By — Br)}new 
is p-dimensional normal with expectation 0 and with the covariance matrix 


1 


1 
1 2 
} 5 (w(-¥00) — 8) —— (ot) — a) de 5-1, (2) 
Proof. It follows from (2.3.53) and from Lemma 2.5.2 that 
nil2(Bye — Bp) ~ mtREA [= mong) — + sty — xp) (3) 
@ ue 


where M()(p°) = (I\(), ..., M(6°)) is given by (2.3.39) and f° is the 
true parameter value. Moreover (2.4.34) implies 


nUAS (y — Xp) wn, = nT, ..., TP) (t 
Tie = Lew oF (6:(6°)), 7 eat eee p- (5) 


nl( Bye — By) wn RE AX’ fe M80) — — of (Ue) (6) 
v(5(B)) = (v(01(8)), ---> v(3n(B))) 
9 F(8(B))) = (o(F (806), ---» o(F(8n(8))))- 


The rest of the proof follows easily from Bunke and Bunke (1986, theorem 
2.4.3). il 


Theorem 2.6.1 has several corollaries which have an interest of their own. 
Corollary 2.6.1 n4?26y ~ np, if and only if 
g(t) = ay(F-Ut)) +6 ae. in (0, 1) (7) 
fora >0,b€ R}. 
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Put 
WO) Ses oy ee RI (8) 


where g is a density such that y satisfies (A3) (for instance, g may be any uni- 
modal density with I(g) < oo). Then By is the maximum likelihood estimator 
corresponding to g. Similarly, put 


p(t) = v(t, 9), Orat i, (9) 


so that is is the R-estimator, asymptotically efficient in the case f = g. The 
asymptotic distribution of n/?(8y4 — Bg) is then normal with the expectation 0 
and with the covariance matrix 


{ : — ) aE 2 (10) 
1 ATO tea 
Donlie e\ ee) ae 


Under (8) and (9), we have the following corollaries: 


Corollary 2.6.2 Let y and satisfy (8) and (9), respectively, with g being the 
density of the normal distribution N(0, o2), 62 > 0. Then n'?By ~ Bp if 
and only if f is normal N(u, 47) with some w € IR1, A > 0. 


Corollary 2.6.3 Let wy and op satisfy (8) and (9), respectively, with g being the 
logistic density. Then, under the assumption of symmetry of f, By ~ nbn 
if and only if f = 9. 

Remark 2.6.1 Let y and 9 satisfy (8) and (9), respectively, with g being the 
density of double exponential distribution. Then n/?28y ~ n/28,_ for any 
symmetric error distribution f. 


Corollary 2.6.4 Let B be the least squares estimator, 1.e. Bi = Bu with y(x) = x, 
aw € IR}; let Bp correspond to a function g(t), t € (0,1). Then n¥28, ~ Bp if 
and only if 


p(t) = aF-1(t) + b fora > 0,6 € R!. (11) 


2.6.2 Asymptotic distribution of the difference 
of linearized rank estimator and R-estimator 


Let Bo be the linearized rank estimator of 8, defined in (2.5. 28), with the least 
squares estimator in the role of the initial estimator Be Let Br be the rank 
estimator (both 8, and Br are supposed to correspond to the score-generating 
function ¢). 
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Theorem 2.6.2 Suppose that the assumptions (A2), (A3), (A8), (A10) are 
satisfied. The asymptotic distribution of n\/?( (Bo — Br) is then p-dimensional 
normal with the expectation 0 and the covariance matrix 


(1 2 a aaa ies fi g(t) — ¢) F-1(t) at] 2 (12) 
y 


where o? = [x? dF(x) — (fx dF(x)). 
Proof. If follows from Lemma 2.5.1 that 
n-VPZ-US(y — XBx) — Sy — XB,)] ~ —yn"(Br — Br); (13) 
and (2.5.2) and (2.5.28) imply 
n2(Bp — Ba) 
A A 1 A A 
~ n2(BR — By) + aE n?X-US8,(y — XBr) — Say —XPi)J- (14) 


Combining (13) and (14), we get 
n2(Bp to B2) as ( = =, n2(Bp oi B,). 


The rest of the proof then follows from Theorem 2.6.1. I 


Corollary 2.6.5 n¥23, ~n¥2B, if and only if either g(t) —% = aglt, f), 
O0<t<li,a>0or of) =aF t)+0,0<t<1lj;a>0,b€ R: 


2.0 Confidence intervals for regression coefficients 
based on ranks 


Let y,,.--, Yn be independent observations such that y; has the distribution 
function F(y — Bx;), 1 = 1,...,n, where F is assumed to be continuous but 
otherwise unknown. A family #@(y) of confidence sets for 6 at the confidence 
level (1 — x) may be based on the rank tests of hypotheses G(6°): B = f° in 
the following way. If for each 6° € IR, 4(f°) is the acceptance region of an «-test 
for testing H(6°), then 


Bly) = {8B «Rt: y € AlA)}. (1) 


For small and moderate values of n, the acceptance regions 4(6°) can be found 
from tables of the null distribution of rank statistics. More specifically, let 
S,(y — Bx) be the simple linear rank statistic 


Saly — Bx) = ¥ 2a,(BP) (2) 


t=1 
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where Rf is the rank of y; — fa; among (y, — Bx, ...;Y¥, — fx,) and the 
scores a,(7), 7 = 1,..., are generated by a function g, nondecreasing and 
Square-integrable on (0, 1), either by (2.4.22) or by (2.4.23). Suppose that the 
two-sided «-test of the hypothesis H(f°) : 6 = 6° accepts H(f°) when 


Cy S Sily — Bx) S CP. (3) 


If either a@,(z) + a,(m —7-+ 1) = const. or c; + ¢,-i4; = const., i = 1,...,n, 


then the distribution of S,(y — 6x) issymmetric and OC = 2% y a,(t) — CO, 
i=1 
Noting the fact that S,(y — fx) is a nonincreasing function of 6 with pro- 
bability 1 (cf. [A 3.8]), we get that the corresponding confidence region is an 
interval 


By) SB S By) (4) 
where 

B = sup {6: S,(y — Ba) > CP} (5) 
and 

B = inf (6: S,(y — Bx) < CY}. (6) 


The probability of (4) is independent of both f and F provided F is absolutely 
continuous. However, for given sample size n, the constants O® and C) 
for which (4) has exactly probability (1 — «) may not exist. To avoid the 
randomization, one can prefer « for which such values exist. For large sample 
sizes, it is enough that the constants C{?,,, C%,, are chosen in such a way 
that the probability (1 — «(n)) of (4) tends to the specified value (1 — «) as 
n tends to infinity. 

The asymptotic relative efficiency of two confidence procedures is usually 
measured by the limit of the ratios of the sample sizes necessary for attaining 
the same probabilities of covering the false parameter value. In such case, the — 
confidence intervals based on the asymptotic null distribution of the rank 
statistics have the asymptotic efficiencies relative to the standard confidence 
intervals equal to the Pitman asymptotic relative efficiencies of the correspond- 
ing rank and standard tests. 

Alternatively, the efficiency might be measured in terms of lengths of the 
intervals. It will be shown in Section 2.7.1 that the ratio of the squares of 
lengths (Z;,)?/L? of the standard and the rank confidence intervals, respec- 
tively, tends in probability to the relative asymptotic efficiency of both pro- 
cedures. Moreover, a multiple of L?2 is shown to be a consistent estimator of 
the asymptotic variance of the corresponding R-estimator of f. 

The length ZL, of the confidence interval is a random variable and cannot 
be bounded unless restrictions are placed on F’, since L, tends in probability 
to infinity as F becomes more and more spread out. Confidence intervals of 
length not exceeding a given number can however be obtained by taking 
observations X,, Xo, ... sequentially. 


13* 
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Stein (1956) first suggested a two-stage procedure for obtaining a bounded- 
length confidence interval in the case of normal population. Later Chow and 
Robbins (1965) proposed a sequential procedure for the mean of a population 
with finite variance; their procedure was extended by Gileser (1965) to the 
linear regression model. An analogous sequential procedure based on ranks 
was investigated by Geertsema (1970). Further work on this problem concern- 
ing the linear regression model is due to Ghosh and Sen (1972). We shall con- 
sider this problem in Section 2.7.2. 


2.7.1 Asymptotic efficiency of rank confidence intervals 


Let Yny, +--+» Ynn be independent observations such that y,; is distributed accord- 
ing to the distribution function F(y — P°x,;), 1 = 1,...,; suppose that 
E(B) <"oo, 

Denote 


z we es 1 
ip a Dy (2ni era; Saas et oa 2, Lni (7) 


and suppose that Noether’s condition is satisfied, i.e. 


lim Bes (ani — %n)/an| —(e (8) 


n—>oo |1Sisn 


Consider the statistics 


ll 
Pr 


where #&, is the rank of y,; — Bani In (Yur — B&nis +--+» Yan — PXnn), and the 
scores @,(7), ¢ = 1,...,”, are generated by a function y, nondecreasing and 
square-integrable on (0, 1), either by (2.4.22) or by (2.4.23). 

Introduce the confidence set 


By) = {8 € Rt: |S,(y — Bx)| Sa,K,} (10) 
where 
k, = 40-(1— 3), Ole (11) 
A? = [(pt)—gP dt, G= f ole) de (12) 
0 0 


and @~? is the inverse standard normal distribution function. 
The following lemma states that the limiting probability of covering the 
true value by #@,(y) is (1 — «). 


Lemma 2.7.1 Under (7)—(12) and for any « € (0, 1), 
lim Ppo{h° € B&,(y)}} =1—x«. (13) 


n— 
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Proof. Py{B? € Baly)} = Pol|Saly — px)| < anK} 

= P,{\S,(y)| S 4,K.} > 20-(K,/A) = 1 ~ « 
as 1 —> oo, where the convergence follows from Theorem 2.4.3. Hl 


In view of the monotonicity of S,(y — Bx) with respect to 6 we can write 


Bly) = (Bry), Bry) (14) 
where 
_ B, := Bry) = sup {6 | S,(y — Bx) > a,K,} (15) 
and 
B, := Bry) = inf {6 | S,(y — fr) < —a,K,}. (16) 
Denote 
L,=B, —B, (17) 


the length of confidence interval (13). The following theorem shows that a? L? 
is a consistent estimator of a multiple of the asymptotic variance of R-estimator 
of 8 based on S,. 


Theorem 2.7.1 Under (7)—(12) and (14) —(17), 
G,Ln > 207 (1 Ed = as (18) 
2} 7 
in probability under the hypothesis B = B°, as n > oo, with y defined in (2.5.7). 
Proof. a,(B;, — f°) and a,(B; — f°) are asymptotically normal, 
2 
N ( A @-1 ( ae = = (19) 
y a a a 


respectively, and thus they are asymptotically bounded in probability. Actu- 
ally, we have 


t 
lim Pp{a,(B, — B°) > t} = lim Pz {* (y _ S — ) ) oe uk] 


== hi ede Day ae > nk} =1 — of 2 (1+) 
n—>oo : an J a y 


(where we have used [A 2.13)). 
We may proceed analogously concerning B,. 
Lemma 2.5.3 then implies that 


lim Pyp{|S,(y — Bz) — S,(y — Bx) + a,(B, — 6°) y| = ean} = 0 
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and 
lim Ppo{|S,(y — Byx) — S,(y — B°x) + a3(B, — B°) | = ean} = 0 


hold for any « > 0. (15), (16), (20), and (21) then imply that 


lim Pyo{\a,(Bt — Bz) y —2K,| =e} =0. 


n—>oco 


Suppose now that > Xn; = 0. The problem is that of asymptotic efficiency 
i=1 
of the confidence interval (14) with respect to the standard confidence interval 


[21 Se | (22) 


1 4 ; 
where S? = Tene x (y; — xB)? and f, is the least squares estimator. 
x. i=1 


Following Lehmann (1963), we shall measure the efficiency in terms of the 
probability of covering false values, more precisely, in terms of the probability 
that the intervals cover the value 6° + 6n~1/2, Then it follows from the relation 
between the confidence intervals and the tests on which they are based and 
from the asymptotic properties of the rank tests (cf. [A 2.13]) that the intervals 
(14) which are based on n observations and the intervals (22) on based n’ 
observations will have the same asymptotic probabilities of covering the 
values f° + dn-1/2 as n —> oo, provided 


, 


ee oytA-4 as N —> oo (23) 
n 


where o? is the variance of F’. In this sense, the right-hand side of (23) is the 
relative asymptotic efficiency of the two sets of intervals. 

Alternatively, the efficiency might be measured in terms of the lengths of 
the intervals. Let L;, denote the length of the interval 1 in (22). Then it follows 
from Theorem 2.7.1 and from (22) that 


L,)? 
Sal ott (24) 


in probability as n - co under the hypothesis 6 = f°. If the intervals (14) 
are based on n and the intervals (22) on n’ observations, respectively,’ the 
ratio L;,/L,, will tend in probability to one, provided (23) holds. Thus the right- 
hand side of (23) is also a reasonable measure of efficiency when the comparison 
is made in terms of the length of intervals. 
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2.7.2 Bounded length confidence interval based on the Wilcoxon test 


Let y;, yz, ... be independent observations such that Yn is distributed accord- 
ing to the distribution F(y — Bx_),n = 1, 2,...; suppose that I(F') < oo. 
We want to determine a confidence interval J, = {8 | By < B < Bt} such 
that 

PAB €1,)=1—« (25) 
and 

0< L, = B, — B; S 2d (26) 


for some given d (> 0). 

If F is not known, no fixed-sample procedure is available which guarantees 
(26) for all F from a large class (say, that of all absolutely continuous distri- 
bution functions), since L, tends to infinity in probability as F becomes more 
and more spread out. Confidence intervals for 8 of the length not exceeding 2d 
can, however, be obtained by taking the observations y,, y2, ... sequentially 
as follows. Having observed y,, ..., Yn, calculate (BZ, By) of (15) and (16) for 
n= 1,2,..., and continue taking observations untill L, < 2d; the first 
integer for which this is the case we shall denote by N(d). Moreover, denote 


Lin) = (%y, +++) Un) (27) 
and 
n 1 n 
a, = > (a; — 3%), Li ae (28) 
t=1 nN j=1 
The following assumptions are imposed on 2(q): 
(A12) max |x};| = a,' max |x; — %,| = O(n-1?), 
1Sisn 1Sisn 
(A13) lim n—a2 = K, > 0. 


(A14) Put Q(a) := (n+ 1—a)a@4+(a—n)a2,, ifnsSasn+1,n=0, 
1,..., where we set a? = 0. We assume that Q(a) is nondecreasing in a 


and that 
m O(nb») == (0) whenever limb, = ), (29) 
n—>0oo Q(n) n—>0o 


s(b) being strictly increasing with s(1) = 1. 


The assumptions (A12)—(A14) represent conditions on the trend of the 
coefficients x,, X2, ..., and, by Ghosh and Sen (1972), they are satisfied in the 
majority of practical situations. For example, they are satisfied in the two- 


sample situation (t;,—=0, %,=1, 1=1,2,... with |a;,| = (2n)-2/2, 
a 

Get aoe (n = 1,2,...); Q(2) = —), forza, =a+th,h>0 

; 2, 

G@a= 125 7..), Ole. 
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Let RB; = Yuly; — yj; — («i — 2%) B) be the rank of y; — fx; among 
j=l | 
— BX, ---; Yn — Bn; B € IR. Consider the Wilcoxon rank statistic based on 


Yrs -++> Yn: 
S.ly — Bx) = D (ei —%) Rh, (30) 


and the confidence region 


Ty) = per IS,(y — Bx)| S : ano (1— 2). (31) 
y12 
Then J,,(y) = (B,, B;) with B,, BZ defined as in (16) and (17); it follows from 
Lemma 2.7.1 that 
lim P 5o{6° € In(y)} = 1 —«. 


n—>0o 


Define the stopping variable to be the first integer N(d) = m) for which 
Ly = Bria) — Bria) S 2d where ny is a positive integer. Consider the interval 


Iy@ = (8: Buia < 8 < Bya)}- (32) 
Having defined a sequential procedure in the above way, two questions 


immediately arise: 

(a) What is the behaviour of N(d)? 

(b) What is the coverage probability of the procedure? 

These questions can, under certain assumptions, be answered asymptotically 
as d -> 0; the problem is still open for fixed d > 0. 

Theorem 2.7.2 Under the assumptions made above, 

(i) N(d) ts a nonincreasing function of d(> 0); 

(ii) N(d) ts finite for all d > 0 with probability 1; 


(ii) lim N(d) = co with probability 1; 
d—0 


(iv) lim P,{8 € Ing} = 1 — &. 
d—0 ; 
Proof. (1) The monotonicity follows directly from the definition of N(d). 


(ii) For any fixed d > 0, 
PAN(@) = co) = Py 1 IN) > n}) 
n=1 
= lim P,(N(d) > n) < lim P,(L, > 2d) 


= lim P,(a,L, > 2a,d) = 0 by (19) and assumption (A13). 


n—>0o 
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(iii) lim N(d) = oo with probability 1 if and only if 


d—0 


K>0 d>0 d’<d 


ae PL, ENE iB fs (33) 


Monotonicity of M(d) implies that 


UN UW) <K=U A {v(>) <a] 


K>0 d>0 d’<d K>0 v=1 v 


Se 2 


K>0 »=1 nSK Usp K>0 nSK 


and the convergence a.s. follows from P(E, > 0) = 1 for any n. 


(iv) We shall only sketch the rather delicate proof of the last proposition of 
Theorem 2.7.2. The ideas are due to Anscombe (1952), Geertsema (1970), and 
Ghosh and Sen (1972). 

It follows from Theorem 2.7.1, from assumption (A13) and from definition 
of N(d) that d*N(d) = O,(1). We have 


PsiB € Iya (y)} = PollSuaY)| S @uayKa}- 


By Theorem 1 of Anscombe (1952), the last probability tends to 20-1(471K,) 
= 1 — a, provided the following lemma holds. 


Lemma 2.7.2 For any positive ¢ and n, there exists a 6 > 0 such that 


P| sup |Su(Yn) — Sa(Yn)| > | <e. (34) 
n’:|n—n'|<6n 

Proof. If follows from assumption (A14) that for any 6’ > 0 there exists a 
An 


6>O0Osuch that sup {1 — 0% 


|n’—n|<dn an 
Let R° = (R°,,..., R°,,) and let 8, = B(RR) be the o-algebra generated by 
R°, n= 1. Then {S,(y,), Bn, 2 = 1} is a martingale (see [A 3.2] for a definition). 


Indeed, 


_— Rey n+1 
E(Snas | %,,) = (%n41 = Xn+1) | ON ee aaa 


n + 2 
x w Roi 
ae = (x; — %_4,) E 3 ia ®,). (35) 


0 
Since # Rost ntt 
n+ 2 
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for any 1 <7 <7, we have from (35) that 


\ 


- iets 
E(Sns1 | Bn) = DL (x; — Zp) =S,, bree. 
i=1 n+1 


Now, we get from the Kolmogorov inequality for martingales (see [A 3.3]) 
that 


P| sup [Sw ape S,| = it 
|n’—n|<6n 

1 n + [dn] n — [dn] 
(dy dee eee 


for n > m and appropriate 6 > 0. 


Bounded length sequential confidence intervals, based on ranks were further 
studied by Huskovd (1982). Sequential confidence intervals based on M-esti- 
mators were studied by Jureékovd and Sen (1981a, b). 
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Chapter 3 


Models with errors-in-variables 


In the previous chapters we considered linear and nonlinear regression models 
in detail. For illustrative purposes we give a simple linear regression model: 


Yi =a + Bu, + &, ¢ sede Tn 


We always assumed the regressors x; to be observed without errors, which is 
the case for important practical applications, indeed. Think of the models of 
variance analysis, in which the x; are so-called 1-0 quantities, depending on 
the presence of an influence factor. In econometrics, too, we often have the 
hope that the errors €; in the equations exceed the errors of measurements. 
But, after a more careful examination of many systems in practice this can 
no longer be assumed and, from the errors-in-equations model, there arises 
a so-called errors-in-variables model. Special errors-in-variables models are 
known in the literature by the names functional relation(ship), structural 
relation(ship), or even functional model (cf. Malinvaud, 1966, p. 378). We can 
also find some older notation. Furthermore, there are close relationships with 
linear simultaneous equations. 

That is why such models should be used more often with respect to the 
regression models, and all the more so because the uncritical application 
of estimators that are otherwise used in regression models may lead to con- 
siderable misinterpretations in the described functional models (see Section 
3.1.2). 

One reason why these models are still not very widely used at present may 
surely be that the inference in such models becomes essentially more difficult. 
Firstly, this is due to totally different identification problems. Secondly, the 
least squares estimator is still relatively simple to compute for linear regression 
models, whereas there arise eigenvalue problems for linear errors-in-variables 
models. In more general models we do not obtain explicit formulas. 

Within the last few years the rapid development of computers has made 
possible essential progress in the application of such models. In turn, the 
increasing applications have also stimulated the theoretical investigations. 
Corresponding to this rapid development there have been good introductions 
into this problem which contain a survey mainly over linear functional re- 
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lations (Madansky, 1959; Kendall and Stuart, 1961, ch. 29; Malinvaud, 1966, 
ch. 10; Schénfeld, 1971, ch: 11; Schneeweiss, 1971, ch. 7; Zellner, 1971, ch. 5; 
Johnston, 1963; Sprent, 1969, chs. 3, 6; Moran, 1971). Practical problems with 
nonlinear errors-in~ DS models are extensively preaten inf Bard (1974). 
treatment of ak Hane mainly in the fields of asymptotics, distribution- 
approximation, optimality, Bayes theory, nonlinear models, numerics. On 
the one hand, this chapter aims at giving a reasonable introduction into the 
problem based on the mentioned works. On the other hand, it is necessary to 
take into account a series of new results when comprehensively representing 
. these known results and to aim at a relatively closed and comprehensive re- 
presentation of the present level in the statistical inference to errors-in-variables 
models. 

We give a short survey of the contents of the chapter (see Table 3.1.1 and 
the table of contents). Simple examples from practical applications are dis- 
cussed in Section 321.1. In models with errors-in-variables the application of 
the ordinary least squares estimator known from the regression model may 
lead to considerable errors, which is shown in Section 3.1.2. General errors- 
in-variables models are described in Section 3.1.3. A survey concerning iden- 
tifiability problems is contained in Section 3.1.4. Because of their fundamental 
importance, the proofs of the theorems by Rezersol (1950) are also given. The 
statement of Section 3.1.5, too, is relatively little known. If the structural 
parameter is not identifiable in a model with a random experimental design, 
then it is not consistently estimable in the corresponding model with nonran- 
dom experimental design. In Section 3.1.6 we give a survey over the almost 
300 works in the field of models with errors-in-variables in order to make the 
orientation easier. 

Section 3.2 contains results on maximum likelihood estimators. Two- 
dimensional linear functional relations are considered in Section 3:2.1. The 
works by Cox (1976) and Dolby (1976a) made it possible to treat models with 
nonrandom experimental design as well as those with normally distributed 
random experimental design in a unified approach. Following this we consider 
multivariate models with nonrandom experimental design only, with one 
exception each in Section 3.3.2 and 3.5.4. First the connection between the 
maximum likelihood and least squares estimators (MLE and LSE, respectively) 
is described in Section 3.2.2. In Section 3.2.3 multivariate linear models with 
known error covariance are investigated. The MLE with independent errors 
of measurement is obtained from an eigenvalue problem. Besides this well- 
known result, equivariance and certain uniqueness properties are shown (ac- 
cording to Héschel, 1978a). For linear models with a covariance that is known 
except for a factor, some known results achieved by the coordinate-free re- 
presentation are summarized in Section 3.2.4. Section 3.2.5 contains the 
fundamental theorem by Anderson (1951a) about the MLE with unknown 
covariance for normally distributed independent observation errors which has 
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been overlooked for a long time. In the same way as with known covariance, 
this MLE may be obtained from certain eigenvectors. For nonlinear models 
we describe, in Sections 3.2.6, and 3.2.7, possibilities for simplification, which 
result from special suppositions about the covariance of the errors. Identifia- 
bility properties of estimators are also given. 

The ‘standard asymptotics’ for MLE and WLSE in nonlinear models with 
a fixed experimental design are considered in Section 3.5.1. There result con- 
sistency, asymptotic normality, and optimality as well as a simple derivation 
of some formulae for the asymptotic covariance. But, for a weakly increasing 
experimental design, too, we can obtain the asymptotic normality. In Section 
3.5.5 this is demonstrated by means of a modified Gauss-Newton iteration 
proposed by Fuller and Wolter (1977). The estimation is described earlier in 
Section 3.3.3. The latter is done within the framework of representing alter- 
natives to the MLE in Section 3.3. Section 3.3.1 considers linear two-dimen- 
sional models, especially instrumental-variables estimators for simple examples. 
The relations between known estimators in linear functional relations and 
simultaneous equations are investigated in detail. Starting from Anderson 
(1976) and the following discussion there we describe, on the one hand, the 
relations between MLE and OLSE in functional relations and of limited- 
information MLE and two-stage LSE in simultaneous equations on the other 
hand. 

An approximate comparison of accuracy of these estimators is contained 
in the summary of the results by Anderson (1974) in Section 3.5.2. The modified 
MLE and two-stage LSE investigated by Fuller (1977) follow in a natural way 
in Section 3.3.1. The results of the approximate comparison of accuracy are 
given in Section 3.5.3. Up to that point only models with non random experi- 
mental design and independent errors are treated. For linear models in which 
errors of measurement and design-points are generated by certain time series, 
Robinson (1977) could construct an asymptotically normally distributed esti- 
mator, which is described in Section 3.3:2 and for which consistency and some 
identifiability problems are investigated in Section 3.5.4; 

An independent extensive section is devoted to asymptotic investigations 
in linear errors-in-variables models, where great value is set on a unified re- 
presentation for the various models. Section 3.3.4 provides an introduction 
into the general model taken as a basis here. The object of the investigation is a 
model with nonrandom, nonobservable variables; for this purpose we first 
explain the parametrization and compile results on the maximum likelihood 
estimator. From the relation of models with random and nonrandom non- 
observable variables there results a ‘canonical’ estimation method by means of 
instrumental variables in the latter case, as a formal special case of which we 
get the maximum likelihood estimator. Within this framework we then develop 
the asymptotic theory in Section 3.4. Following the introduction of the asymp- 
totic model in Section 3.4.1, the consistency is considered in Section 3.4.2 and 
an interpretation of the necessary assumption is given. In Section 3.4.3 some 
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special cases from the literature are discussed. Then the question of the effi- 
ciency of the MLE (more generally, that of the defined canonical instrumental 
variable estimation) is at the centre of our considerations. The starting point 
here is the fact, that in the model the case of infinitely many unknown inci- 
dental parameters (hence of an infinite-dimensional parameter space) arises, in 
which the classical theory of the asymptotic efficiency is not applicable. A 
model-specified solution of the problem is given here by considering a certain 
heuristically motivated class of estimators (asymptotic Q,,-estimators). 
Roughly speaking, this class consists of estimators which are functions of the 
second sample-moments; with this, the most important alternatives to the 
MLE known in the literature are covered. The asymptotic efficiency of the 
MLE can be proved by means of a corresponding optimality theory; this is 
done by the covariance matrix of the normal limit distribution, where nor- 
mality is assumed. Furthermore, a simple efficient estimation results from an 
improvement method (Sections 3.4.4 and 3.4.5). Some well-known results 
about the limit distribution and the comparison are represented as special cases 
or variants. Section 3.4.6 shows some results for the nonnormal case and Sec- 
tion 3.4.7 contains supplementary remarks in connection with the basic pro- 
blem of the model. 

The checking of the hypotheses and estimates of the regions is treated for 
linear models in Section 3.6 and 3.7. The relatively short representation is 
based on the results of Anderson (1951a). 

Finally we describe some possibilities of the numerical computation of 
WLSE. The often applied two-dimensional linear models with different but 
known covariances are considered first. In Section 3.8.1 a method of William- 
son (1968) to compute the GLSE is given. In general nonlinear models the 
dimension of the unknown parameter that is to be estimated becomes very 
great as compared to the regression models. But the special structure of 
errors-in-variables models permits a practicable transformation of the known 
iteration methods even for relatively large size of the experimental design. 
Based on work by O'Neill, Sinclair, and Smith (1969). a Newton-Raphson- 
type method is described for two-dimensional polynomial models in Section 
3.8.2. In Section 3.8.3 we consider the methods for general models with errors- 
in-variables. In particular, we discuss the special structure of the Gauss-New- 
ton method in these models. 


3.1 Models with errors-in-variables 


3.1.1 Funetional and Structural relations — an introduction 


When we investigate a concrete system we first establish a deterministic model 
as a set of structures (cf. Bunke and Bunke, 1986, Section 1.3). In practical 
problems such models are mostly described as functional models. They reflec 
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the knowledge or the assumptions about the mathematical relation between 
the investigated variables, in the first stage mostly in the form of equations. 
In practice, these variables can only be observed with an evidently random 
error. This leads to the corresponding observation model. 


Example 3.1.1 The relation between mass m, volume v, and density d is 
m = dv. We want to ‘determine’, or more precisely, to estimate, the density 
d on the basis of several measurements of m and d. This example was discussed 
in detail in Madansky (1959). 


Example 3.1.2 For uniformly accelerated motion we have, for the starting 
velocity vo, the time t, the acceleration 6, and the covered distance s, 

S = Uol + bt7/2. 
For example, we are interested in a confidence region for vp and b if measure- 


ments of the two other quantities are available. 


Example 3.1.3 With the physical pendulum we have, for the duration of the 
oscillation 7’, mass m, the moment of inertia MW, and the distance a between 
the oscillating pendulum rod and the centre of gravity, 


T? = 4n°M/mga, 


where g is the acceleration of gravity. 
Already from the simple linear functional relations, important fundamental 
questions can be discussed. Let the basic deterministic model for the variables 


&, 7 be linear: 


y= a + Bé. (1) 


Now we observe ” points mu; = [&;, 7;],7 = 1, .-.., n, satisfying this model. But, 
they can only be observed with a random observation error §;. Thus, 


= mit Si (2) 
is observed, or written componentwise, for 2; := [#;, yi] 
x, = &+d;, 
(3) 
Yi=ni + &- 


In case J; = 0, we obtain a bivariate linear regression model. The vector 
fh := In) = (Mi)r,....n Of the observed test points is consequently ‘a sort of? 
an experimental design. But this one is unknown and, contrary to the regres- 
sion models, it can not be planned directly. The errors often are assumed to 


be independent: 


DeEGi=1.e2, x € Mz. 
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We can also simply write repeated observations of the same experimental design: 
Zip = Wit Size k= 1,...,m;, (4) 
where m; denotes the number of replications in the ith design point. 


Example 3.1.4 (cf. Barnett, 1970) In a medical experiment the relation 
between the protein level 7 of the urine and the applied drug dose & was in- 
vestigated and repeated measurements were taken. The relation between & 
and 7 is supposed to be linear. Now the é; and hence the yu; can very well be 
realizations of random quantities. 


Example 3.1.5 Possibly the dose é; obtained in Example 3.1.4 has to be mo- 
delled as a random quantity with a distribution of its own. Random variations 
of the dosage arise, e.g. by the filling of the syringe. We proceed in the same 
way if we obtain the specific dosages from a certain chemical process where an 
exact measurement of the dose is not possible. 


Now, if we make special suppositions about the distribution of tq), i.e. 
the common distribution of the u;, i = 1, ..., n, it is sometimes even possible 
to investigate the distribution parameters of ((,). 

A simple model of this kind is 


ni = o-+ B&;, i= 15.5, 2; 
. (5) 
a= Mit Si M2= [Se nil, 
with the distribution model 
§.@ 7, = IP, |0 ¢ OF, 
i © {(N(0, 2) | 2e I}, 


where the §; and the ¢; are supposed to be stochastically independent. 

This model was thoroughly discussed in the literature. The terminology 
‘linear structural relation’ was adopted. It was of greater importance because 
there arose some fundamental difficulties with the identifiability in the simple 
model, which had to be discussed in detail. On the other hand, only this simple 
model had been accessible to numerical computations for a long time. Namely, 
only for the case that the P, are normal distributions we can secure the iden- 
tifiability of « and 6 under certain restrictions to /’ and we can obtain, besides 
elementary computable estimates for « and f, those for X and #, too. But also 
other models with a random experimental design with the usual assumptions 
on normal distribution could be investigated, at least asymptotically; e.g. 
the following model with replications of observations (cf. Cox, 1976; Dolby, 
1976a; Patefield, 1977a; Brown, 1978a; Chan and Mak, 1979a, b): 


Nig = % + BE, §i; © (N(4;, o;) | 3; € IR, o; € R=}, 
aij — 14 te Cig, Si) i) {NV (0, 2’) | = diag (o1, 02), 0; = 0}, 


OS =i US cease Onn jal Sig OR 2 


(6) 


(7) 
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With further restrictions on the parameter region this model also yields the 


other models mentioned so far. But it is obvious that the same functional 
model 


ni =a + BE, 4m + oj (8) 
may be completed by different distribution models. 
From the supposition of nonrandom y; to the different distribution assump- 
tions for {,), @ great number of practically important linear models can be 


imagined. This is also the case for simple bivariate nonlinear models. Let us 
take for instance the simple quadratic model 


ni = % + Bids + BG, a= Mito (9) 
without distribution assumptions. 

Finally, all essential differences between such models result from the problem 
whether with the increasing number of observations, the number of the points 
Mi, t= 1,...,n, which are usually denoted as incidental parameters, also 
increases. For all these models we use the term ‘functional relation’ and — if 
necessary — we distinguish between those with a random or a nonrandom 
experimental design. 


3.1.2 Comparison with regression models 


The regression model is a special case of a linear functional relation if the 
independent variables may be observed without errors. In the following we will 
demonstrate — by means of the bivariate linear functional relation (LIFU) 
from Section 3.1.1 — the following facts: 


1. In a LIFU we do not have to distinguish ‘regressors’ and ‘regressands’, 
nor ‘exogenous’ and ‘endogenous’ variables. 

2. A LIFU may be formally written as a linear regression model with stocha- 
stic regressors, but in such a ‘regression model’ the usual supposition on the 
independence of equation errors and regressors is not fulfilled. 

3. This is the reason why the uncritically applied least squares method fails 
here and provides inconsistent estimations. 


In (1) the LIFU was given in the form 7; = « + f&;. Then &; need not be an 
exogenous variable, as Examples 3.1.1—3.1.3 show. Essential in these exam- 
ples is that the vectors [€;, 7;] lie on a straight line in the plane. But, the special 
parametrization (1) does not imply the straight line 0 = é; if we leave out of 
account a possible inclusion of this special case in the form 8 = oo. Thus, in 
many cases we will not have prescribed exogeneous variables in an implicit 
representation of the form 


O= 1+ Bobi + Bini (10) 
and we will not be able to mark any of the two variables as being exogeneous 
or independent. 
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Further, from (1) and (2) we obtain the representation 
Yi =x + Ba; + (8 — Bd). (11) 


Since the observations of 2; = [a;, y;] are known, we have with (11) a re- 
gression model in which the ‘regressands’ y; and the ‘regressors’ 2; were obser- 
ved. The equation error ¢; — fd; has a vanishing expectation as in the re- 
gression model. But 


Cov (a;, €; — Bd;) = E(E; + dj, &; — Bd;) = —Bo; (12) 


holds, a statement which obviously is true for the LIFU (6), too. Hence, the 
covariance between the regressors and equation errors vanishes only for o, = 0 
if we do not take into account the special case 8 = 0. 

We know that in a regression model with random regressor (cf. Bunke and 
Bunke, 1986, example 2.1.1) the best linear unbiased estimator results for 
each realization x(n) of ®,) from the OLSE with the fixed regressors 2) 
(see e.g. Rao, 1973, 4.a.11). Thereby we assume the independence of the 
regressors #; and the errors &;, of course. The obtained estimator is consistent. 
If the regressors and errors are correlated as in the present. case, this is no 
longer true. In this case the OLSE is 


B= Lei — #) (yi — GS (ei — 8), | 
: (13) 
a:= 9. — Bz. 
The statement of this example is not diminished if we assume the errors 6; 
and &;, respectively, to be independently identically distributed and if d; is 
independent of €;. We write o; := D(d;), o, = D(é;). 
Furthermore, let 


lim & =:¢ < co 
lim ss &,0;,/n a= |bhen yd) E;e;/n = 0, (14) 
lim ¥ (8; —£)¢[n =:0; <0, 


hold for the experimental design p(n). By a simple application of the law of 
large numbers we show that 


B+ lim 4 (é, + 6; —F, —3) 


noo t 


x (BE; + &; — BE, — é.)/> (é; + 6; —€, — 6, 


= Bo;|(o¢ + 95), (15) 
& > x + Beos/(o5 + Or). (16) 


Thus, the OLSE is inconsistent for o; + 0. 
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Under more general assumptions the inconsistency can be proved for multi- 
variate LIFUs by mans of the martingale theory (cf. Robinson, 1977, lemma 1; 
see also Section 3.5.4 and [A 3.2]). Some practical investigations from econo- 
metrics indicate that the detection of this inconsistency is of fundamental 
importance for the correct interpretation of data (see Malinvaud, 1966, p. 380). 
Having investigated the three points of comparison mentioned at the beginning, 
the regression models and LIFU shall now be compared with respect to pre- 
diction. In doing this we follow the thoughts of Malinvaud (1966, pp. 382 
to 383): For this purpose it suffices to consider the case « = 0 in simple LIFUs. 

The problem of prediction consists in determining a value y*, ,(2,)) a8 a 
prediction fo Ynsi OF ny, depending on the observation z,). Thereby let z,,) and 


En+y OF Ent1 be given. Usually E((yn+s(Ym) Am Tins1)?) and Biz Zn 41(%n)) ee Ynsr)? ’ 
respectively, are minimized as a risk. 


1. Functional models of the described kind are often used to predict the 
corresponding value of 7,,, when the ‘true’ value &,,, is known. For instance, 
we want to predict the mass 7 on the basis of the estimated density 8 when 
the volume é takes a certain value. Or, we would like to predict the increase 
of the total social consumption that is to be expected when incomes reach a 
certain amount. Let f(z) be an arbitrary estimator of 8. Then y%,, = = BE 
is the prediction and the risk is 


EY +1 — Nasi)? 
a & E(B er B)? 


To minimize the risk, the mean square deviation of 8 and f has to be minimi- 
zed. But, because of the inconsistency of the regression OLSE we may obtain 
an improvement at least for great observation numbers by using a consistent 
estimator. 

2. In many practical cases LIFUs are used to predict y,,, when knowing the 
observed value z,,,; and not the value &,;, that is assumed to be known. For 
instance, we want to predict the weight of a workpiece which is not contained 
in the sample on the basis of the measured volume, or, we want to predict the 
consumption of a certain product for a household basing on the measured 
income. 


(17) 


Now, let 2,,; have a distribution that is determined by p,,, and ¢,,, and 
that is the same as that of z;,7 = 1, ..., n. If the regression function — i.e. the 
conditional expectation — of y,,; over #1; is linear, then the regression OLSE 
indeed provides a satisfactory estimation f of f, which is suitable to predict 
Yns, for observed “,,,. However, for the LIFU the linearity of the relation 

= fé is not necessarily transformed to the regression function of y,,, over 
n13- This is the case only under special assumptions (cf. Lindley, 1947; Kendall 
and Stuart, 1961, ch. 29, pp. 56—59). For this general case (and also if Uny 
has not the same distribution as the y;, 7 = 1,...,) it is obvious, to take, 
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as in 1, y*., = B%n,, as the prediction because other prediction functions are 
more complicated. The error of the prediction is 


Aes ae — Yau = Ca(B — $) +: (bniiB — €nu1)- (18) 
With given 2,,, it follows for the risk that 


E(Az ,,) = v2, E(B — B)? + 2B(2%n(B — B) OniiB — &n+1)) 
+ E((On118 — €ns1)°) (19) 


With the stochastic independence of §,., and 2») the second term vanishes 
and 


E( 441) = wn 41H(B — BP + BOB + bn) (20) 
results. 

Hence we can obtain an improvement compared with the inconsistent re- 
gression OLSE in this case, too, by giving a consistent estimation, at least for 
large samples. Finally, the uncritical application of regression methods causes 
misinterpretations also in this second problem of prediction. 


3.1.3 Models with errors-in-variables 


3.1.3.1 The fundamental model 


In Bunke and Bunke (1986, sections 1.2 and 1.3), basic definitions for the 
investigation of statistical problems of the formation of models were given. 
The definitions — directed to the application in regression models — con- 
tained the regressands in explicit dependence from the very beginning. Typical 


was the example 1.2.5 with the system equation 


Yi= D(b, Xj, €;, Ui). 


There y; was the output vector, x; the input vector, w; and ¢; were the control 
variables and error variables, respectively, and b was the system parameter. 
There are already simple deterministic models in which such an explicit 
description of the system is no longer suitable. 


Example 3.1.6 The movement of a celestial body round the sun in a plane 
is described by a quadratic equation, i.e. for linear local coordinates vo and 
v® we have 


0 = 2 + QAM + AYA + 7@yB* + IMyVy 
+ xy" = so(v, 2) 


with a := [x, ..., 7] and v : = [v®, v®]. 
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The solution of such an implicit system equation with respect to one variable 
— explained as being independent or exogenous — will not be possible in 
general. Additionally, the preference of one quantity as exogenous often does 
not make any sense. For instance, the two local coordinates are totally equi- 
valent from the practical point of view. 


Definition 3.1.1 (Deterministic functional model) Let v € IR® be a vector of 
system-describing variables. The model is given by a parameter-dependent equa- 
tion 

0 = s(v, 2), 8 : IRet4s > BR’, ¢ <d,. (21) 
a € I. — R* is called the system or structure parameter. 8 is said to be the system 


or state function. The equation (21) is called the system or state equation. As state 
manifold or structure for the parameter x we denote the set 


St, = {v7'| 0 = s6(v,/x)}: (22) 
The set of state manifolds admitted in the model 
Stq := (Stabnen (23) 


is called a structural bundle. 
The set JJ of system parameters admitted in the model may possibly be 
described by an implicit equation: 
IT = {xn € R4|0 = p(a)}, p : Ri +R” (24) 
Continuing Example 3.1.6, the following example shall demonstrate the prac- 
tical importance of nontrivial restrictions of the kind 0 = p(z). 


Example 3.1.7 Among all orbits of celestial bodies let us only consider para- 
bolas. These are generated by the following restrictions of the parameters 
Geet ee... se) | 2 


gr) gr64) 
0 = det ( ; 
a» gp (3) 
(In fact, here are still included pairs of parallel or imaginary straight lines). 


Already with simple examples of applications, not only implicit equations 
to describe the model may be suitable, but more general sets x and St, may 
also occur. 


Example 3.1.8 Again we consider the Kepler model for the movement of the 
planets around the sun. The orbits of the planets are ellipses where one focus 
lies in the sun. The related model is covered by a certain subset of quadrics, 
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let us say of the form 
0 = s(v, 2), wed, 


where z € IR® describes the set of all ellipses from the Kepler model. 
However, ellipses are just not described by equations of the form p(z) = 0, 
but by inequalities, namely: 


fp. S\(Op A.A, <0, 


gt) 64) a) \ 
AG = det a») (3) erate) 
aD yf?) 6) 


(9) (4) 
AAs = det ’ A; = ) oh qe) 
(4) 3) 


Further restrictions result from the demand that the sun lies in one focus. 


In the general definition of a deterministic functional model we abstract 
from the parametric explicit description from Definition 3.1.1 (cf. Figure 3.1.1). 
The use of the general notions permits a simple and illustrative description of 
estimation methods like the least squares estimator. 


Definition 3.1.2 As a deterministic functional modet for a cause-effect relation 
between the components of a vector v € IR® of system-describing variables we denote 
an arbilrary set Sty of structures 


St; = (Stet7,  StecR*®, WOR. 


The index x € IT ts called the structural parameter. 


Sty 


Fig. 3.1.1. 


After having defined deterministic functional models and having thereby 
shown that also implicit and more general models may be of practical impor- 
tance, we want to introduce stochastic functional models. They result from 
deterministic models if the related observational model includes random vari- 
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ables coming from random errors of the observations or from the approximation 
of the unknown structure. The vectors v can be distributed on the structure St, 
randomly. 

Functional relations are the first class of models we consider. They include 
nm fixed points mw; on the structure St:y; € St,, 7 = 1,...,, the so-called ex- 
perimental design. But the observation 2; of yu; can only be oy eined with an 
additive random error ¢;. 


Definition 3.1.3 A functional relation with nonrandom experimental design is 
given by. 


= w+, ‘= Le ees 4, € RY 
0= So(Mis I) 5 So(-, I) : RY > HIRO, (25) 
O= pln), pl) : RY > Re, 


The yu; are called incidental parameters or design points. The €; are realizations 
of random measurement errors with a distribution model 


Siny = (Si)s en OSE NP. || yy ET} 5 


In special applications we do not always have to have a ‘direct’ observation 
error ¢; with 2; = uw; + §;, as the following example will illustrate. 


Example 3.1.9 As the orbits of the planets are not directly observable on the 
apparent celestial globe, we obtain the following situation (cf. Figure 3.1.2). 
In the state space the cause-effect relation is described by a deterministic 
functional model 


Sty = {St,} 7, 


Furthermore we have a transformation A of the ‘variable space’ into the 
‘observation space’, which is the apparent celestial globe in this case. For 
simplicity we consider A to be parameter- and time-independent. Then the 
points u;,7 = 1, 2,..., n, on the fixed structure St, can no longer be observed, 
but only the transformed points A(u;) with 


2=Alwi)+$i, t=1,...,0. 


This example also shows the sequential character of the practical modelling 
very well. In the beginning Kepler was not able to bring the exact observations 
of planets by Brahe into a good correspondence with the Copernican model: 
St, = {eccentric rotary motion}. In 1609 after he had adjusted the obser- 
vations to (estimated) orbits St,,,, he found the laws named after him. With 
this the Copernican heliocentric system had got rid of its shortcomings. But, 
with increased accuracy of observations also this model does not suffice. The 
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long-known rotation of the perihelion of Mercury could be explained sufficiently 
exactly only by a model from Einstein’s general theory of relativity. 


Fig. 3.1.2. 


A slightly more compact way of writing implicit models is 
Zn) = Mn) + S(ny> Mn) € IR" 


0= S(ny(M(n)> Z)» $n) = [Siht,...,n 


with a distribution model for the observation error: 
Gin) © Pew = LPs | Ws de 


As in this case the regressors are also nonstochastic, these models are called 
models with nonrandom experimental design or models with nonrandom in- 
cidental parameters. 

In the literature restrictions have been given to the mean and covariance for 
the distributions. In most cases H5 = 0 is assumed, and for various investi- 
gations also the covariance is assumed to be known, as for example in Britt 
and Luecke (1973). 
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We will mostly treat explicit models in the following. Many implicit models 
in their numerical treatment have to be reduced to explicit ones. The follow- 
ing unified notation shall be used for explicit models. 


Definition 3.1.4 An explicit functional relation is given by the equation 
U= ro(Si a)), hz [S45 nil, 
“=F + 6;, v= + 1, 1=1,...,n, 


where 2, = [x;, yi] € IR® are realizations of the random observations z;. The 
distribution of z;, is determined by the distribution of the observation error 6; 
= [d;, &]. We have d, = d,. The é; are called incidental parameters. 


Mostly we assume the ¢; to be independent and identically distributed: 
$5 © PX = {(P, |y eT}. (27) 


3.1.3.2 Linear functional relations 


Definition 3.1.5 A (homogeneous multivariate) linear functional relation with 
nonrandom experimental design (LIF U*) is given by 


i = Bé,, Be racy: 
a= & + 40;, dim é; = d, —c=:4,, (28) 
Y¥i= Nit &, dim yj = ¢ =: d; 


and a distribution model for the observation error §(,, ~~ P,, y € I’. The vectors 
2; := [xi, yi] are the random observations and the mu; := [&;, ni] yteld the experi- 
mental design. 


In many practical applications the $; are supposed to be stochastically in- 
dependent. But there is an increasing number of works that suppose depen- 
dences between the $;. We often used the representation of the parameter as a 
column vector, which follows Definition 3.1.4, 


Sere (bi), yeeey dn? b; S R%, 


where the 6; are the row vectors of B. 

But 7 = Bé does not include all linear (d, — c)-dimensional subspaces. For 
d, = 2 we have 7 = fé. Consequently we can obtain the 7-axis only as a limit- 
ing case 6 = oo. Fortunately, by this we only exclude a null set of linear spaces 
(cf. Remark 3.2.4 in Section 3.2.5). 

It is obvious and more suitable for theoretical purposes to choose the set 
£, of all r-dimensional subspaces of the R“ as the structural bundle of the 
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model. Correspondingly let 2-, = U &,. Instead of (28) we obtain as a ‘linear 


model’ more generally TS4q 

epee Cc La, —dy or LE ey =a, (29) 
or 

Mn EL", LEX -a,- (30) 
Equivalently 

HRA EC Sil Oe SiGe aes (31) 


Using generating matrices for the subspaces we obtain further possibilities 
of representations by 


me RL), LE Mexia (32) 
or Max (d-e) » d=d,, c=d,, 


or 
Mi = Lj, LeMay Pee Ree (33) 


Here £ € 4, is uniquely determined, but neither LZ nor »; are. 
Equivalent to (29) we get 


1 Oe PNY EGO Knees r=—d, or r2d,(=¢). (34) 
A further possibility is provided by 
(Rj, ---) fa)) = Fy — F,- 


An extension of the model is obtained by 


w= Mn C4", LER saa, (35) 
2A Bes, A Oni cara 


~ with known matrix A (cf. Example 3.1.9). 
This model includes some important special cases. For instance, LIFU+ with 
repeated observations of a fixed experimental design result from 


A=D®ii,, D = Diag (1n,)i, m= m, 
where the ‘design point’ yw; is observed m; times. A further special case is 
A=U'®Iu,, Cee Mice s sn: 


Analogous to the usual regression models, linear errors-invariables models 
with r < n are called singular. For singular LIFU* we have to carry out iden- 


3.1. Models with errors-in-variables pepaiy| 


tifiability considerations analogous to those in Bunke and Bunke, 1986, 2.2.1) 
(cf. Section 3.1.5). 


(35) is closely connected with multivariate linear regression models with 
restrictions to the regressions parameter. Namely, if we write 


M = En) > Z => 2m)» & = S(m) > ft = A(L+), (36) 
then we get the model 


Z=MU+E, ZEMuaxm ete. 
(37) 
j OE eI 6a oh Oe er 


Now, if we understand M € Ni,., as a matrix of regression parameters, then 
the mentioned connection results. Notice that here the number of columns 
of the parameter matrix M € M.,., may ‘increase’ together with m. 


3.1.3.3 Linear functional relations with fixed experimental design 
and with linear regression part 


Extensions of this model are closely related to linear simultaneous equations 
(cf. (40)). As LIFU* with a linear regression part we denote the model 

Z=M,U+M.V+$=:MW+S, W=[UiV], M=(M,!mM,), 

(38) 

Me ane Ga) FAG nx men ly =O Le Me 


T= Ce 
The first time such models were investigated in detail was by Anderson (1951a). 
The relations between such regression models and LIFU had been overlooked 
for a long time so that the results derived for this had to be rediscovered for 


various special LIFU*. 
The way of writing of (38) corresponding to (35) is 


2 = (UO @ 13) Mayr + V © La) bene + Gin) (39) 
fae lL, LCenaees i= 1,...,m, Hj € IR, 7 =1,..., Mp. 


There are close connections between LIFU* with a linear regression part 
and linear simultaneous equations. Let v; denote exogenous variables and let 


0= L1%; + fn; + 0, EEE ENG Gs LE My xa, (40) 


be those equations of a complete linear simultaneous equations model for 
which the coefficients of a second vector of exogeneous variables «; and of 


15* 
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endogenous variables z,; are known to be zero (cf. Anderson (1951a, eqn (6.6)). 
The aim is to utilize this prior information to improve the estimates of the 
remaining smaller number of parameters — called therefore ‘limited information 
maximum likelihood’ (cf. Section 3.3.1.5). In the reduced form, however, some 
parameter matrix M, of u; will occur. But since u; does not occur in (40) we 
must have +’ = 0. Therefore the starting point for further investigations is 
a subsystem of the complete reduced form of the linear simultaneous equations 
model: 
a = My; + My, + Si, 
(41) 
AM 5 05 4 Lee Sg - 
(The questions of identifiabily and inference that are related with the original 
model and the use of the reduced form are not further treated here; see Ander- 
son, 1951a, pp. 344—345). But the subsystem of the reduced form (41) yields 
nothing but a LIFU* with a linear regression part according to (38), namely 
WO (Aly gin aes Ug VV (Og we os, Ua) = 
In the econometric literature we often consider explicit linear simultaneous 
equations in which L is parametrized in the following form: 


Lt = (Bi = I); Be Mip-a) xa (42) 


This representation provides the relation between known estimators for linear 
simultaneous equations and LIFU* (cf. Example 3.2.2 in Section 3.2.5.). 


3.1.3.4 General models with errors-in-variables 


Starting from functional relations we obtain a more general class of interesting 
models if we suppose that not all but only some of the variables are observed 
with errors. The state vector v; can be divided into one part ;, which is ob- 
served with errors, and a part w;, which is observed without errors: 


0; = [wW;, wi]- 


The w; can also be denoted as regressors. As in functional relations, it shall 
be possible that the system function for the ith state depends on the index 7, 
too. 

Thus, let 


Stin = {v; € R® | 0 = s;(v;, z)}. (43) 
be the state manifold for the ith state. 
In functional relations we have St;, = Sty, = : = St,, i = 1,..., 2. Simpli- 
fying, we put 

Si(Wi, Mi ™) = Si(Ui, 7), (44) 
if the (exactly known) regressors w; play no role in the current considerations- 
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The common state manifold for all n observed states has the product form 
n 
S, = X Stis = {Mn |0 = 8,(v;,2), t= 1,..., n}, (45) 
i=1 


which is typical of errors-in-variables models. 
The state vector v,) and the system parameter vary in the set 


S = {[rqm 2] € Rtas | 5(v49), 2), plac) = 0} (46) 
= U {zt} x ss 


nélI 
This set can be denoted as the system manifold. 

Here s = 8(n) = (8i)1,...n i8 the system function. With fixed w,) and a, the 
unknown incidental parameters in the (w, z)-section of S may vary: 


Diez es {U(n) | 0= Si(Wi, Mis It), ae Tee "9 n} : (47) 
While in functional relations the structural bundle 
Sty = {Sta}, St, = {v € R® | 0= So(V, zt)} 


suffices to characterize the model, we have to use the following structural bundle 
in general models: 


Sr a Sx (48) 


Mostly we consider explicit models. Thereby the state equation for the ith 
state has the form 


Ni = Ti(Wis Fi, 7) (49) 


with 7; € IR°. In this case the variables-vectors w;, &; may be denoted as inde- 
pendent or exogenous. The functions 7; are regression functions in a generalized 
sense. Here, too, we simplify 7;(w;, &;, 7) =: 7ri(&, 2) if the known w; are of no 
importance. We want to emphasize that the variables observed without errors 
can also be covered by means of the distribution model for the observation 
errors. Then we have to permit singular covariance matrices. This is extensively 
developed for linear models in Section 3.4. 


3.1.3.5 Regression models 


In the context of the general errors-in-variables models, regression models 
can be obtained in a natural way. Regression models are such errors-in-variables 
models in which the erroneously observed variables wu; are explicit functions 
of the errorless observed regressors Wj: 


Mi (=i) = Ti(Wi, 7). (50) 
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Thus, the system functions s; have the form 
Si(Wi, Mi, T) = —HWi + 7{(W;, 7). (51) 


We also obtain regression models if we start from an explicit functional rela- 
tion and take a special degenerate distribution in the distribution model for 
the observation errors. For this purpose we start from an explicit relation 


ni = Ti(Ei, 2) (52) 


M=sn) a= yil, Ci = [4;, e:]. 


In the distribution model for the observation errors we suppose 6; = 0. Then 
the x; = ¢; are the regressors, and we get the well-known representation 


Yi — (is qt) ar &i, a = ii cee NM. ry (53) 


Conversely, we also can denote the functional relations as regression models 
with errors in the regressors. 

Concluding this section, we want to discuss some differences in application 
of regression models and more general models. Obviously regression models 
can be used to give a prediction of the ‘dependent’ observations y; on the 
basis of the ‘independent’ variables w; and 2;, even if the underlying ‘correct’ 
observation model includes errors in the variables. In such a case one wants 
to know the conditional expectation H(y/x, w) which yields the best prediction 
of y by w and x under quadratic loss (cf. Rao 1973, 4.g.1). For this, regression 
models of the form 


Yi = 7(K, Xi, Wi) = 8; (54) 


are suitable. Then by FF (i,.) an estimation % also yields an estimation of the 
conditional expectation. In contrast to this, the general errors-in-variables 
models serve to model the system behaviour, that is for the prediction of 
‘dependent’ state variables 7; after fixing the ‘independent’ &;. 

This difference can be made clear by an example. For instance, a safety 
inspector supervising the erection of a building might be interested in knowing 
which value attains the true shearing stress 7, given that a surface load x was 
observed. We would apply a predictive regression model. But for an engineer 
designing a building it is more important to know what the true shearing 
stress 7 would be if the true surface load & attians a specific value. He would 
apply an errors-in-variables model. 

Moreover, regression models are suitable if the error of the measurement 
of the ‘independent’ state variable é; is small compared with the error in the 
‘dependent’ state variables. They are also suitable if the equation error is 
great in comparison with the error in the measurement of the independent state 
variables (cf. Section 2.1.6). This is, for instance, the case in variance analysis. 
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Here the regressors w; are exactly measurable 0-1 vectors and then é; includes 
the comparatively large model error. For a long time it has been known that 
in a linear system model where the state surfaces are linear subspaces, the 
corresponding regression function for the observations must not be linear 
(cf. Kendall and Stuart, 1961, 29.56—59). 


3.1.3.6 Functional relations with random experimental design 


The models considered so far have been based on a deterministic model of 
the form 0 = s(u;, x), where fixed design points u; € St,, were given. Among 
other things, it is characteristic for such models that the number of the unknown 
parameters y;, 7 = 1,...,, increases with the number of observations. Since 
each distribution of the single observation 2;, only contains the parameter 
;, among all the others u;, 7 = ip, the y; are called incidental parameters (cf. 
Definition 3.1.3). 

A totally different situation arises if the experimental design is randomly 
distributed on St,. This distribution of the w; is often supposed to be independent 
and identical and then the number of the parameters does not increase with the 
observations. But dependences have been admitted recently (cf. Robinson, 
1977). Here it holds for the corresponding sequence of models that the dimen- 
sion of the parameters of interest does not increase, although the dimension 
of the incidental parameter may indeed increase: 


Definition 3.1.6 Explicit functional relations with random incidental parameters 
are defined by 


4=-¢Mi+ o> He — enn] 


(55) 
Ni = To(Sis 7) 
with a distributional model for the stochastic variables 
Win) >= [Sn im] © FP? = (P, | % € K}, (56) 
where the stochastic independence of § ni) and $ i) is mostly demanded : 
Perel Pr (@,7€eO XT: (57) 


Of course this definition can also be applied to implicit models. 
Then we have to demand additionally in the distributional model that 
Px(0 = 8(2, 2, U, $)) = 1 holds for y = [2,u,$]. a relation that is always 
satisfied in the described explicit functional relations. This model can also 
be written in a a more compact form corresponding to Definition 3.1.4: 


2m) = Mn) + (mys Meny = [Fi MDa, «sn» 2m = [Pie Yida,...sn» 
Mn) = Tn) (E(nys B)> 
En © Pi = {PE|0€O}, Sin OP! = {PE ly ET}. (58) 
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3.1.4  Identifiability 


Now we turn to fundamental questions of identifiability. Thereby we investi- 
gate only functional relations in which all state variables are observed with 
errors. Corresponding to the possible distributional suppositions, we have to 
distinguish between models with random and those with nonrandom incidental 
parameters. But we have the same state equations; hence one could expect 
that both models have similar properties with respect to identifiability. But 
the different distributional assumptions in the models result in totally diffe- 
rent situations with respect to the distribution of the observation z;,): 

For a random experimental design (4;,, we have 


Pe = pert, (59) 


Here the structural parameter z influences the distribution of the observations 
only indirectly, via the distribution 


Pte= Pee (60) 


But with the exception of some special cases the distribution of 2=u+¢ 
can be rather difficult to obtain. 

We have a completely different situation with a nonrandom experimental 
design “,»). Here 


Pe = Pett — px, (61) 


As u is fixed now, the distribution of 2 = w + ¢ is obtained without difficulty 
from that of § by shifting the mean value. The structural parameter z does 
not directly occur in the parameters of the observational distribution P*”). 
There arise other problems with these models because of the number of un- 
known parameters that increases with the number of observations. 

This problem is impressively demonstrated by the failure of the maximum 
likelihood method, which initiated a great deal of discussion. 

For this purpose we consider the simplest LIFUt-model by (1), (2): 


05 O 


Gree — 
Pt:= N(0,2(y)), = ( : 


) y = [o5, o,] € R> eR= 


é 


with stochastically independent §;. 
Then 


Yr = [x, B; srr 95, oe] (62) 
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and the likelihood function is 


tAv,)) == oo, Vo. “exp 


1 1 
ve Be Liles — 4)? — bs i ee pray. (63) 


It can easily be shown (cf. Lindley, 1947) that the relation 


Cen = BeOsas (64) 


follows for the MLE. 

The consistency we know from other statistical models would only result 
if o, = Bo;. This failure of the likelihood method in LIFU was discussed in 
detail in the literature (see Section 3.1.6). Hence we could conclude that the 
application of models with a fixed experimental design is only possible with 
difficulties, and consequently we should use only models with random experi- 
mented design. But the distribution of 2 can be given explicitly only in a few 
cases, among others for normally distributed u; and ¢;, which, indeed, include 
most of the univariate random quantities in practical cases. 

But precisely this case causes fundamental difficulties for LIFU- in identifia- 
bility. The following fundamental theorem goes back to Reiersol (1950a). 


Theorem 3.1.1 Let the simple LIFU- model be given according to (5), (6). 
Then B is identifiable if and only if at least one component of wu; is not normally 
distributed. 

Proof. The characteristic function of ¢; is 


p(t) = exp {—0’2#/2}, t € IR’, (65) 
and for the pu; it holds that 


Pult) = exp (atta) Ye(t, + Ate), (66) 
with 7 as the imaginary unit. For 2; it this follows that 

y(t) = exp (att, — t’Lt/2) p;(t, + Pte). 
Now let 

= (68,8, 21+ 9, 
but P% = P5. With this we would also get %, = $z, ie. 

exp (xtt, — t’Xt/2) y;(t, + Pte) 

= exp (Kit, — t' 24/2) Gel, + Bt). : (67) 
For 6 + B we could find such numbers 4, ¢, that 

ttfe=s, t+ fpr =0 (68) 
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holds, in case s € IR? is arbitrarily given. Hence 


t, = —Bs/(6 — 8), t, = s/(6 — B). 
Inserting in (67) it follows that 


y;(8) = exp (ta,8 + aes?) 
with 

a, = (& — «)/(B — B), 

a, := (B, 1) (& — 2) [B, 11/2 (6 — 8). 
Consequently §; is normally distributed or a constant, by a known characteri- 
zation theorem (cf. Kagan, Linnik, and Rao, 1973). This can be shown ana- 
logously for 4;. Hl 
Coorllary 3.1.1 « is always identifiable if B is identifiable. 


Proof. Let p = B and y= P;. Then we have t, = s — ft, in (68). Having 
inserted this in (67) and having taken the logarithm, it results that 


1 
In (v</(s)@z(s)) = (& — «) it, — 9 Ils — te, e]I/? 


eee ie (5, ty) ( : H (2 — £) (; cH Ee ul} (69) 


As this equation holds identically in s and ¢, the coefficients of the monomials 
Sty, tz, (2 of the right-hand side have to vanish. This yields 


CoS 5 (70) 
O5e — Ose = B(o5 a G5), 

(71) 
O, — G, = B(os. — 5.) = P*(o5 — Gs). 


(Similar conclusions follow from the consideration of »,, of course.) The equa- 
tion (70) means the identifiability of «. 


It is interesting that the remaining components of y do not necessarily have 
to be identifiable. The next theorem shows that this is only true — at least 
in the bivariate LIFU- — for regression models in principle, i.e. if one of the 
variables is observed without errors. 


Theorem 3.1.2 Let 6 be identifiable. Then the other components of y=[«,B, 9,2] 
are identifiable if and only if the following conditions are fulfilled: 


1. Hither d; = 0, or €; = 0. 
2. Neither §; nor 4; have a distribution that is divisible by a normal distribution. 
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Proof. Uf B is identifiable, so is x (Corollary 3.1.1). Then [9 | &'] is not identifiable 
if there exists a y + with y= P=, where « = &, 6 = f. This is equivalent 
to the validity of (71) and (67). Let us now insert (71) in (67). With this, the 
nonidentifiability of the remainder of y is equivalent to the existence of pairs 
[#1 X] + [91 X], which satisfy (71) and 


P:(8) = G(s) exp {—(05 — Gs) 8/2} (72) 


at the same time. (Divisibility of the distribution of a random quantity by 
that of u means that there exists a random quantity » with vy, = GyQy). 

Now the identifiability follows from the conditions 1 and 2; since, if (71) 
and (72) were satisfied at the same time, we would have @,(s) exp {—&5s?/2} 
= ¢;(s) because of o; = 0, and this would contradict condition 2. (The case 
o, = 0 can be treated with the equation for », corresponding to (72).) 

Conversely, suppose one of the conditions is not satisfied. If condition 1 is 
not fulfilled, we choose 6; with o; > 6; > 0 and o, — f?(o5 — 65) > 0. (72) 
and (71) are fulfilled then. If condition 2 is not satisfied, it holds e.g. for é; 
that 


P:(8) = Puls) exp (ims) exp (—os"/2). 


Then we choose 6; >0 sufficiently small so that o + (6; —6;) >0 and 
6. := 0, — B*(o5 — G3) > 0 holds. The function gy, determined according to 
(72), is a characteristic function and (71) is fulfilled at the same time. 


Theorems of this kind were also proved by Reiersol (1950a) for LIFU~ with 
independent error components [d;, €;] which do not necessarily have to be 
normally distributed. Some statements concerning this can also be found in 
Lukacz and Laha (1964, ch. 6.3). By means of general characterization theo- 
rems we obtain theorems for the multivariate case, too. A detailed analysis 
is given by the work of Rao (1966) and with other methods by Jeeves (1954). 

Of course, such statements can not be used for checking existing practical 
models with respect to their identifiability. Although the normal distribution 
provides a sufficient approximation for many practical purposes, it is always 
an approximation only. This holds because the supports of the distributions 
in practical cases are always finite. Thus the identifiability is theoretically 
secured in practice. But the principal importance of such identifiability theo- 
rems lies in the fact that we could theoretically expect an identifiability, 
but this is so ‘weak’ that the necessary sample size for practical identification 
can not be realized in practice. A way out is the replication of the experiments 
or the use of other additional information. 

Compared with this we always have the identifiability of « and y with non- 
random incidental parameters. From P*% = P*, we obtain 


fy = Lz = Lz = Mp. (73) 


236 Chapter 3. Models with errors-in-variables 


And y remains identifiable in the family P% if it was identifiable in Pi. Then 
}y = [My also yields 


High ese ed (74) 
and equivalently 
Pi = FF (75) 


ye 
hence 


Baltes oYbxc (76) 


There still remains the question of the identifiability of the structural para- 
meter, which does not always have to be secured, as the following simple 
example will show: Let us take the straight lines in IR? as structural bundle. 
Now, according to the previous considerations j,) is uniquely determined by 
P*, but this does not imply the unique determination of a corresponding straight 
line for » = 1. 

In this example two design points are already sufficient for the property 
that almost all experimental designs on the true structure are identifying for 
the straight-line parameter, as it is intuitively obvious. But for more general 
nonlinear models the problems related here demand more detailed investi- 
gations, which will be reported on briefly at the end of the section. For linear 
models we can prove the following identifiability theorem with elementary 
means. 


Theorem 3.1.3. For LIFU* with a linear regression part the linear subspace 
£ = KL) is identifiable iff r|W] = n. 


Proof. 7[UPy’1] = 1, holds for r[W] = n. Hence 
R(E(Z(L = Py'1))) = R(M,UPy 1) = R(M,) = £. 


Conversely let r[ W] < n. Then there exists a vector § € R" withE’W = 0€Myy. 
Now we choose a & € M?>%,),, with & as the first row. We get EW = [0, é], 
EE My—q—yxe Let L = (Ui £) € Meh, 1 € IR®. We choose L® = (i | L) 
with t, | A(L®) =: £, Then the relation 


HM) = £, + RM) = 4, 


holds for M® := L£, where in M{" is the block of M® = [M™ i mM] 
belonging to U in W = [U1 V]. But we have 


Mow = MOW = LE. 


Consequently P*: = P** does not yield £, = f, and thus £ = R(M,) is 
not identifiable. 
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It should be taken into account that this does not answer the question which 
and ‘how many’ experimental designs on the true subspace identify this sub- 
space. The theorem only states the following: in case an experimental design 
is contained in a subspace from £ d,-.> then we can truly identify this from the 
observations. However, for other asymptotic investigations we need a descrip- 
tion of such models where one fixed experimental design can not be contained 
in two different structures at the same time. Details my be found in Héschel 
(1978b, 1986). 


Definition 3.1.7 Let a fixed structure S, from the bundle Sy be given. An ex- 
perimental design vn) on Sx, is called identifying for S, — with respect to the 
bundle Sy — if vn) is not contained in any other structure from S,. 
(Of course we assume that exactly one parameter only corresponds to each 
structure.) 

In this case 


IU(Y(ny) = 8, = {7 | [2n)> ze] € S} 


is a 1-element set. In particular, this is the case if z(v) is empty. Then v does 
not lie on any state manifold from the model and consequently is not identi- 


fying. 


Definition 3.1.8 An explicit model with the system functions n; = rj(wi, §;, 2) 
for the observed states is called internally (structure) identifiable if, for all x € IT 
and almost all En), Win), the experimental designs Vn) = ([Wi, Mi])(n) Induced by 
means of the system function on the state manifold S,, are identifying. 


Theorem 3.1.4 (Identifiability theorem for nonlinear models) We obtain 


internal identifiability if: 


1. there ewist sufficiently many experimental design-points, namely n > dim IT; 
and 

2. the corresponding state manifold S, has no weak contact of infinite order for 
different structural parameters (cf. Héschel, 1986): 


We notice that most of the functions that are practically applied in curve- 
fitting satisfy this second assumption: polynomials, exponential and trigono- 
metric functions. For regression models we obtain the following result. 


Theorem 3.1.5 In a regression model almost all experimental designs win) 
are identifying tf: 


1. n > dim I; and et 
2. different state manifolds do not have a weak contact of infinite order: In this 


case the state manifolds are given by 


Si =a {Un t= ([wi, Mia sen 7 his r(Wi, Hae i= 1, sey n} : 
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An illustration results if we recall that 
Sty — 40 — [= Te) 


is the graph of the related regression function in the state space IR® for fixed z. 

To prove some fundamental properties of estimators in errors-in-variables 
models, especially in order to prove consistency, we need the just treated iden- 
tifiability not only for almost all but for all states. We can always achieve this 
by an inessential restriction of the parameter region if only the assumptions 
of the identifiability theorem are fulfilled. For this purpose we only have to 
omit for each z € JT the null set of those independent state vectors w;, &;, 
i =1,...,n, that provide nonidentifiability. Over this parameter region the 
identifiability results for all parameters from the observation distribution so 
that P&” = Pr yields yo = y,. Then we get pny = “nyo and y, = yo aS In 
(74), (76). Finally, since ;,)9 is among the identifying states on the state mani- 
fold s,,..,.1, because of the restriction of the parameter region, 7, = 1% also 
follows. 


3.1.5 On the existence of consistent estimators 
under a nonrandom experimental design 


In the last section it was argued that random design can cause nonidentifiability 
and inconveniently just for the normal distribution which has proved its prac- 
tical value in many other applications. On the other hand, for distributions 
of the design other than the normal distribution, the distribution of observa- 
tions can not be obtained easily. We do not have these disadvantages for 
nonrandom distributions of observations. They can easily be described and the 
parameters are practically always identifiable. This is the reason one would 
be tempted exclusively to use nonrandom designs for modelling. However, 
the ‘failure’ of the MLE, described in Section 3.1.4, indicates that we have to 
construct other estimators for certain models to obtain consistency. In this 
section it will be shown that for a large class of models with nonrandom design, 
consistent estimators cannot exist. Consistency is a property of convergence 
for infinite sequences of observations. With an increasing number of incidental 
parameters we also have to take into account infinite sequences of parameters. 
With a nonrandom experimental design the distribution of 2,,) is induced 
by a distribution supposition for ¢,.): 


Pico = Pk coy FF 0009 , Uy Ss UWicsys TT, y]. 


Here (4(..) satisfies the equations 0 = s;(w;, ;, 7). 
With random experimental design the distribution of u,..) is described by a 


foe} 
parameter y € 0: Ui.) ~ Po,,, where the structures x St;, have to be the sup- 
i=1 
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ports of the distribution P, ,. Hence, for all i = 1, 2, ..., it has to hold almost 
surely that 0 = s;,(w;, 4;,). With explicit models this holds from the very 
beginning since the 4; may have an arbitrary distribution which is induced by 
a distribution of §;. In case wu; and ¢; are stochastically dependent, the distri- 
bution of the observations z,) is uniquely determined by the parameter 
y = [6, x, ]. 


Definition 3.1.9 In an errors-in-variables model an estimator # = (Zn) 
that is defined for all natural numbers n is called consistent (respectively strongly 
consistent) for x if 


n—>co 


Ce Ne (resp. 2 (%(n)) + zx) 
holds for each parameter yp = [M..) 2, y] € VY. 


(In general errors-in-variables models the estimator #% still depends on the 
exactly observed regressors wn). 

Now we call a model with random experimental design as belonging to a 
model with nonrandom experimental design if both models have the same 
state equation. Provided that in the consideration of a model with random 
experimental design yu») is a realization of u,), we can formally establish the 
related model with nonrandom experimental design. Then we can give a condi- 
tional estimator #(2,) | “n)) from the distribution of which we can obtain 
the unconditional estimation of the original model, at least in principle. For 
instance, let uw and ¢ be independent, then 


P= Safe Persie dP#(u) zsi( Pets dP*(u), 


where P“*+¢'# is the conditional distribution of 2 for fixed w. Such a proce- 
dure corresponds to a conditional inference in errors-in-variables models 
(Madansky, 1959, p. 175; Kendall, 1951, p. 17; Malinvaud, 1966, p. 378). 
The following result, which demonstrates the close connection between both 
models, is valid. 

Theorem 3.1.6 In a model with random experimental design let the structural 
parameter be nonidentifiable. Then it can not be estimated in the corresponding 
model with nonrandom experimental design. 

Proof. Suppose # is a consistent estimator in the model with nonrandom ex- 
perimental design. Then we have 


lim P iA | ny = Mays Scny = Sm} = 1 


with the set | 
n—oo 


y contains the parameter for the distributions of [4(..), $(~) and the structural 
parameter z. 
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By a general zero-one law (cf. Loéve, 1978, 32.4.A) 
P(A) = 1, 


hence # is consistent in the model with random experimental design. But this 
contradicts the assumed nonidentifiability, because for at least one pair 
x +: 7 it should hold that 


i 
Py = PF 
Because of the consistency it would follow that #74 2+%<ia. 


The theorem shows that, without additional assumptions on the sequence 
ico)» We can not obtain consistent estimators in general, because, without further 
assumptions on the sequences m,..) there could also arise realizations from a 
model with a nonidentifiable structural parameter. Nussbaum (1978a) descri- 
bed this connection for linear functional relations. 

We obtain identifiability, for example, if we carry out repeated observations 
of a fixed experimental design. Above all, such models occur in natural and 
industrial sciences. This holds if the design points lie on a straight line and 
\|4:]] < ||uisal|. This example is to be found in Ware (1972). But we can not 
consider each model with a fixed experimental design as belonging to one 
with a random design. There are sufficiently many examples in practice, in 
which the experimental design can not be considered as the realization of a 
random variable. An example would be the periodic observation of a planet 
after an exactly fixed time interval. Here the time error would be small in 
comparison witt all other errors in the model and therefore the observational 
time may be considered as nonrandom or deterministic. 


3.1.6 Bibliographic remarks 


The aim of this section is to contribute to the easier orientation among the 
great number of the results that are available at present and it will facilitate 
access to the literature. Furthermore, it motivates the representation of the 
following sections compared with other monographs and indicates problems 
in solving open questions. The representation is based on a bibliography, in 
which completeness was aimed at. Of course, the following selection is sub- 
jective. 

The basic classification is carried out according to nonrandom and random 
experimental design, first for statistical estimators and then for the other 
statistical inference methods. The first works on errors-in-variables models 
concerned the bivariate LIFUt-model: Adcock (1877, 1878), Kummel (1879), 
Pearson (1901). They treat the problem of the estimation of straight lines by 
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minimization of the orthogonal and the weighted squared distances between 
observations and a straight line. This method of least squares, which provides 
MLE for normally distributed errors, has been developed to become the theo- 
retically and computationally most applied method. For some periods in the 
1940s and 1950s there dominated discussions about the previously mentioned 
fundamental questions of statistical inference in such models and most of all 
about simpler estimation methods. 

With the recent development of computer techniques, the computation of 
WLSE and related estimations has become possible also for nonlinear models, 
and the WLSE is correspondingly often applied and investigated. Uven (1930) 
considered LIFU* with normally distributed errors and a covariance known 
up to a factor. Koopmans (1932) proved this estimator to be an MLE, and he 
approximated the variance of the estimator for small error variances. For un- 
known covariance Dent (1935) gave the solution of the likelihood equations. 
Inndley (1947) showed that the stimators resulting from this fulfil an equation 
that is incompatible with consistency. This gave rise to detailed discussions 
about the application of MLE in bivariate LIFU+ with nonrandom design. 
Here the number of incidental parameters increases with the number of obser- 
vations. Following the fundamental works of Neyman and Scott (1948), Kiefer 
and Wolfowitz (1953) were able to show that, under certain assumptions on the 
distribution of the ~,) in the bivariate LIFU-, the likelihood method also 
provides consistency in the related LIFU*. Although the practical value of 
this method is very restricted, it showed the principal possibilities of the con- 
sistent estimability in LIFU*: But it turns out that the maximum likelihood 
solutions under unknown error covariance 2 indeed yield only a saddle-point 
of the unbounded likelihood function. Anderson and Rubin (1956) proved the 
unboundedness for the model of linear factor analysis, which only represents 
a special LIFU. 

In the literature Solari (1969) has mostly been cited; she rediscovered this 
result and proved the MLS to be a saddle-point. Sprent (1970) derived some 
conclusions for the practical application of this method from this. Copas (1972) 
showed that the consideration of rounding errors in a certain form provides 
again a global maximum. Sprent (1976) discussed more general questions which 
arise from the likelihood method for LIFU*. But even for known error variance 
it has not been possible to prove the MLS as local extremum (Moberg and Sund- 
berg, 1978). The attempt to analyse the likelihood situation in more general 
LIEU is to be found in Florens, Mouchart, and Richard (1976) and Willassen 
(1979): 

More and more there are also treated multivariate LIFU, mainly LIFU*, 
with normally distributed errors. But the difficulties that already became 
apparent when applying the MLE in bivariate models, and even more in 
multivariate ones, demanded the assumption of ‘additional information 
about the sequence of the experimental designs or about the form of the error 


distribution §(n): 


16 Nonlinear Regression 
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Naturally, the cases of repeated independent observations of a fixed experi- 
mental design play a special role here. For known error covariance Geary 
(1948) gave the MLE as eigenvectors of a certain eigenvalue problem, which 
previously had been found heuristically by Tintner (1945). For uncorrelated 
normally distributed errors ¢),...,$, with unknown covariance Anderson 
(1951a) gave the MLE for LIFU* with linear regression part. Because his 
paper has been titled ‘estimation of linear restrictions for regression coefficients’ 
(cf. (38), (39)) these results were overlooked for a long time in the works on 
LIFU. Various partial results in special cases were thus rediscovered (Acton, 
1959; Villegas, 1961; Nussbaum, 1976). The inconsistency of the MLE covariance 
estimations can be found in Villegas (1961) and Gleser and Watson (1973). 

Hannan (1967) provided a complete consideration of the relations between 
linear simultaneous equations and canonical correlations. A unified derivation 
of these results with fixed covariance and a covariance to be estimated is 
finally possible on this basis. Then Robinson (1973, 1974) gave a unified deri- 
vation of the WLSE and MLE for a general time series model comprising the 
mentioned models. 

For bivariate LIFU with normally distributed incidental parameters, the 
MLE was given for various models in a comparatively closed form (Cox, 1976; 
Dolby, 1976a; Brown, 1978a; Chan and Mak, 1979a, b). One of the needed 
minimization results is contained in Anderson (1951a) as a special case. Based 
on GLSE which are MLE under normal error distribution, we will derive asym- 
ptotic properties, especially consistency and normality. Already Villegas (1966) 
was able toprove asymtotic optimalitiy properties of the GLSE. In Barnett (1970) 
a formula for the asymptotic variance in case of replications is to be found, which 
was corrected later by Patefield (1977). For the limited- information MLE (LIML) 
in linear simultaneous equations, which corresponds to the MLE in LIFU* with 
replications, Anderson and Rubin (1950) showed the asymptotic normality. 

But such properties were also shown for the case that there are no replications, 
in Schneeweiss (1976) for a corrected OLSE under suppositions on higher-order 
moments. For a large class of instrumental variables estimators, Nussbaum 
(1977, 1978) derived the asymptotic normality and proved the optimality 
of the GLSE. In the work of Robinson (1974), mentioned above, we can also 
find the idea of proof for asymptotic normality. 

A survey and detailed discussions of further asymptotic results can be found 
in Anderson (1976). There the distribution of the WLSE is approximated, if 
\l4c|? becomes large and ~ remains fixed. Thereby quite a number of known 
results on linear simultaneous equations could be transformed. Patefield (1976) 
compared the results in a Monte Carlo study. Based on the reduced form of 
linear simultaneous equations (cf. (38) and (41)), all results can be transferred 
mutually. The limited-information MLE in linear simultaneous equations 
corresponds to the MLE in LIFU*. For known error covariance, asymptotic 
expansions of the density of LIML are derived by Mariano (1969). 

For the two-stage LSE the densities are developed from Basman (1961, 


f 


& 

1963), Richardson (1968), Sawa (1969), Sargan and Mikhail (1971). These den- 
sities also resulted in the form of double-infinite series of incomplete beta- 
functions. The distribution of the two-stage LSE results as a double series of 
noncentral F-distributions (Anderson and Sawa, 1973). With estimated co- 
variance the density of the LIML was determined by Mariano and Sawa (1972). 
From this it resulted especially that also in LIFU for MLE there do not exist 
higher-order moments. This fundamental result stimulated modifications of 
the MLE for which higher-order moments exist. Fuller (1977) showed that these 
modifications, to the order O(n-?), are better than a corresponding modification 
of the ‘k-class’ estimators. These estimators contain the two-stage LSE and 
the OLSE, and were originally introduced by Theil (1958). Nagar (1959) gave 
approximations for the moments of the approximating distributions. Fuller 
(1977) also computed the optimal modification parameter and was thus able 
to explain the relatively bad power of the MLE compared with the two-stage 
LSE known from Monte Carlo studies. The approximations obtained in this 
way are valid for increasing sample size. Other approximations are obtained 
for small error variance (Kadane, 1970, 1971). The relation to those given above 
for increasing sample size was shown by Anderson (1977), who also referred 
to the fact (see also Brown, Kadane, and Ramage, 1974) that in Nagar (1959) 
and Kadane (1970, 1971) the results were erroneously interpreted as approxi- 
mate moments of the distribution. In this sense Robertson (1974) and Williams 
(1973) indeed only gave the moments of the approximating distributions in 
LIFU*. The asymptotic normality for the modified estimators is treated in 
detail by Fuller (1977). For a more general linear simultaneous equation model 
with instrumental variables, Robinson (1974) gave an elegant derivation of the 
MLE and the asymptotic normality. The proof provides a unified approach 
to many models. Some algorithms for WLSE in LIFU are investigated in 
York (1967), Spathe (1967), and Williamson (1968). 

On another application of the least-squares principle in LIFU* there was 
published a much-discussed work by Sprent (1966). By heuristical considera- 
tions for LIFU* a kind of ‘GLSE’ was constructed here if ¢,,) has a general 
but known covariance 2 € M>,,. Dolby (1972) showed the equivalence of these 
‘GLSE’ with MLE (= WLSE) for normal errors as a special case in nonlinear 
functional relations. Another approach to the equivalence of WLSE and MLE 
in the normal case is investigated in Héschel (1978a). The equivariance is 
also shown there, and problems of identifiability are considered. 

Since 1960 nonlinear models and WLSE have been investigated more inten- 
sively. But the first general algorithms already appeared in Deming (1931, 
1943) and Oook (1931). The ideas for constructing WLSE by linearization of 
3(a,-) over the last iteration, described there, have remained of interest till 
today. But first, from 1960 on, the WLSE algorithms were developed for special 
nonlinear models: in Hey and Hey (1960) for hyperbolas, in Robinson (1961) and 
Chan (1965) for spheres. Clutton-Brock (1967) considered bivariate models and 
Griliches and Ringstad (1970) investigated the bias of the OLSE in quadratic 
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models, O’ Neill, Sinclair, and Smith (1969) treated general polynomial models. 
For general explicit functional relations Dolby (1972) gave the WLSE equations 
for known covariance of the ¢;, and Britt and Luecke (1973) determined them 
for general implicit models of the form (7) with known covariance § = §(n)- 
Later on Dolby (1976b) showed the equivalence of both methods, namely that 
of the ‘scoring’ method for explicit models and of the Lagrange multiplicators 
for implicit ones. 

For unknown error covariances Dolby and Lipton (1972) investigated for 
bivariate functional relations the related Newton-Raphson algorithm. The 
asymptotic variance of the estimator for the case of replications is also to 
be found there: With a slightly modified dependence structure of the error 
covariance D(6,)) = A © 4, A known, this was further developed for multi- 
variate models by Dolby and Freeman (1975). Besides these algorithms there 
could also be shown statistical properties of the WLSE for nonlinear models. 
Villegas (1969) gave the asymptotic variance of such an estimator in the case 
of replications and he showed the asymptotic optimality in a certain class. 
Fuller and Wolter (1982) demonstrated under weaker assumptions the asympto- 
tic normality of a WLSE for univariate explicit models and gave the asympto- 
tic variances. This is also valid for a special estimation in quadratic functional 
relations (Fuller and Wolter, 1977). The sequence yw; of the incidental para- 
meters has to satisfy certain weak assumptions. Modern numerical procedures 
for the computation of WLSE may be found in Southwell (1976), MacDonald 
and Powell (1972). Globally convergent algorithms on the basis of regularized 
Gauss-Newton procedures have been developed by Héschel and Penev (1980) 
and Tiller (1983). 

From this survey we can see the wide applicability of WLSE and related 
estimators. Their efficiency is mainly based on the fact that they is obtained 
by minimization of certain quadratic functionals. But there are quite a number 
of further estimation methods for functional relations which use principles 
other than the minimization of a quadratic functional. These methods were 
investigated mainly in the 1950s. One of the most popular estimation methods 
for bivariate LIFU* was the grouping method, where the straight line 
runs through the means of two groups of observations. A work by Wald (1940) 
made clear the possibility of the consistent estimation under certain assump- 
tions to the realizations m,,) if all observations are used. Compared to this, 
Bartlett (1949) omitted a middle group of observations. Investigations of 
the corresponding method in LIFU- followed (see below). Asymptotic variances 
for such estimations can be found in Dorff and Gurland (1961b). Furthermore, 
they compared their properties with the OLSE for small sample size on the 
basis of approximations of the bias and of the mean square error. These appro- 
ximations contained results of Brennan and Housner (1948). 

However, such grouping estimations can be understood as special instru- 
mental variables estimators (IVE). Such estimators were first studied by Reiersal 
(1945), namely for linear simultaneous equations. Richardson and Wu (1970) 
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compared OLSE and other IVE on the basis of the variance. Sargan (1958, 
1959) showed the relation between the two-stage LSE, which originates from 
the two-stage estimation in simultaneous equations, and the IVE. Guarian 
and Halperin (1971) provided the exact expressions for the variances for more 
general cases by means of applying hypergeometric functions. A combination 
of IVE and OLSE was investigated by Feldstein (1974). Thereby, greater losses 
of efficiency, which are characteristic of IVE, as well as the bias of the OLSE, 
are mutually compensated, whereby approximations and Monte Carlo simu- 
lations are used to prove this fact. Investigations of efficiency for ranks as 
instrumental variables, a method that goes back to Theil (1950a, b), are to 
be found in Ware (1972). These results occur again in the general approach to 
the asymptotic optimality of IVE in Nussbawm (1978). Further methods for 
LIFU- are obtained in the case of replications, but also with groupings and 
other instrumental variables by using variance components according to 
Tukey (1951). There is a good representation in Madansky (1959, pp. 189 —194). 

Another method for LIFU*, which also allows estimations in the nonnormal 
case, results when using cumulants (Geary, 1942, 1943). Principally this method 
can also be applied in LIFU~ and for nonlinear models, but it has only been 
of theoretical interest till now. Kendall and Stuart (1961) offer a good summary. 
Newer methods of estimation in LIFU* mainly aim at reducing the variance 
of the estimators, if necessary at the expense of the unbiasedness. To this 
group we can assign the works by Lord (1960) and DeGracie and Fuller (1972). 

The difficulties in identifying and handling the distribution of observations 
are clearly reflected in the works on estimations with LIFU-. At the beginning, 
bivariate LIFU- were considered, and MLE of the different parameters, with 
additional information about the error covariance, were repeatedly published 
by various authors independently of each other (e.g. Kummel, 1879; Pearson, 
1901; Lindley, 1947) (cf. Madansky, 1959). Geary (1949) was the first to show 
the identifiability in the nonnormal case by means of his method of cumulants 
(cf. Scott, 1950; Drion, 1951). Before this, the nonidentifiability in special 
LIFU- had already been proved by Thomson (1919), Gini (1921), Frisch (1934), 
Neyman (1937), Koopmans (1932). Reiersol (1950a) gave the necessary and 
sufficient conditions. A detailed analysis of the multivariate case was worked 
out by Rao (1966); see also Jeeves (1954). 

The grouping method by Wald (1940) for LIFU* was modified for LIFU-, 
and comparisons of efficiency were carried out for special error distributions: 
Nair and Srivastava (1942), Bartlett (1949), Teil and van Yzeren (1956), Nawr 
and Banerjee (1943), Gibson and Jowett (1957). Dorf and Gurland (1961a, b) 
calculated the estimator variances for small and large samples. The identifia- 
bility condition for this method was given by Neyman and Scott (1951). 

Other estimators constructed by Neyman (1951) and Wolfowitz (1952) 
remained of only theoretical interest because they are difficult to calculate. 
Spiegelman (1979) offered an estimator that can be calculated more easily. 

Some discussions were evoked by the situation of the ‘overidentification’ 


246 Chapter 3. Models with errors-in-variables 


in the normal case with known error variances, which arose by wrong appli- 
cation of the MLE, (cf. Kendall and Stuart, 1961, 29.11). Kiefer (1964), Barnett 
(1967), and Birch (1964) showed that the correct application of the MLE pro- 
vides correct results. 

The application of instrumental variables was investigated by Reversal 
(1945). Geary (1949) studied the efficiency. An example from biology is con- 
sidered in Carlson, Sabel, and Watson (1966). Lyttkens (1977) gives a survey. 
A special multivariate LIFU- is investigated in Barnett (1969), where 
the starting point was the calibration of measuring instruments in medi- 
cine. 

Estimations with repeated measurements by means of sample covariances 
are contained in Tukey (1951); Madansky (1959) gives a good survey. Asym- 
ptotic variances are given in Dorf and Gurland (1961a) and a more complica- 
ted estimator was constructed by Housner and Brennan (1948). 

For LIFU with dependent observations, Robinson (1977) constructed a 
contrast function, which provides consistent estimators. The related mini- 
mization problem is similar to WLSE problems. With a new approach, which 
uses martingale theory, consistency, and asymptotic normality, the correspond- 
ing variances are derived and algorithms are given. This model comprises 
LIFU* as well as LIFU-. 

Problems of Bayes’ inference were investigated by Hl Sayad and Lindley 
(1968), Zellner (1971), Villegas (1972), Florens, Mouchart, and Richard (1974), 
where the basis was certain noninformative prior distributions. Some tests were 
constructed by Anderson (1951a) in his fundamental work on LIFU* with 
linear regression part (cf. Section 3.1.3). Independently, Williams (1955), 
Bartlett (1957), Basu (1969), and Moran (1956) considered some special prob- 
lems, e.g. for known error covariance. Confidence intervals may also be found 
in Anderson (1951a). Later on, special cases were considered once again in 
Creasy (1956), Brown and Fereday (1958), Halperin (1964) and Villegas (1964), 
among them also some for known error covariances. For LIFU~ according 
to (7), asymptotic tests and confidence intervals were developed in Cox 
(1976): 

Finally we want to mention some results on special models that are closely 
connected to functional relations. Controlled variables were considered in 
Berkson (1950) for LIFU and in Fedorov (1974) for nonlinear models. Lindley 
(1947) studied the problem of the relation between the conditional expecta- 
tion of y under & and the corresponding LIFU for é and 7. There resulted 
conditions under which a LIFU between the é; and 7; provides a linear re- 
gression of y; over aj. 

Mainly starting from econometric models, models with dependent errors 
and time-series models are considered. Concerning this, there are results in 
Robinson (1974), and for time series in Nowak (1975, 1976, 1977), Aigner (1966). 
The problems connected with such models are only partly touched upon in the 
present chapter. 
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3.2 Maximum likelihood estimators 


The survey about the literature illustrated the central role of the MLE and the 
associated WLSE. The following section provides a survey. First, in Section 
3.2.1, bivariate linear functional relations are treated in detail, where in the 
initial model the experimental design is normally distributed. From this model 
we obtain on a unified basis a number of interesting special models and the 
corresponding estimation formulas, among them those for models that are 
known as ‘structural’ and functional relations in the literature. For such simple 
models there result estimation formulas that can still be handled on a small 
computer. This does not hold for multivariate models with random design so 
that in the multivariate case only models with nonrandom experimental 
design will be considered. 

In Section 3.2.2 we describe the relation between MLE and WLSE for 
general errors-in-variables models with nonrandom experimental design. 
Based on this, the MLE is comprehensively derived for multivariate LIFUt 
with known error covariance in Section 3.2.3. The equivariance is also shown. 
In Sections 3.2.4 and 3.2.5 two important LIFU* with unknown covariance 
are investigated. In these cases we still can obtain the MLE in a relatively 
closed form, as the solution of an eigenvalue problem. 

For general models we can no longer give closed formulas. Sections 3.2.6 
and 3.2.7 describe some possibilities for simplifying numerical calculations in 
these special models. Finally, we briefly mention identifiability properties of 
estimators. 


3.2.1 Bivariate linear functional relations 
3.2.1.1 The general model 
We consider the LIFU 

Ny =a + BE, and lata Or 

ty = Fy + Oy sR Wes 

Yig = Nj Taj 

§i; © {(N(H;, o¢) | Fi € RY, a; = O}, 

Si; © {N(0, 95) | o5 2 O}, 

&; © {N(0, o,) | o, 2 9}, 


which was described in Section 3.1, where the random variables are all sup- 
posed to be independent of each other. Inserting known parameters yields 


248 Chapter 3. Models with errors-in-variables 


various important models as special cases. A model not included here, which 
is nevertheless important, can be found at the end of Section 3.2.1.2. 

The uniform treatment of all models is based on the following principle. 
We insert the known parameter values or the corresponding functions of them 
into the likelihood function for the general model, e.g. o; = 0 or o; = @105, in 
case 0, is known. Then we obtain the MLE with the known parameter values 
as the maximum of the thus simplified likelihood function. The same holds for 
the related likelihood equations, where the roots of the gradient of the likeli- 
hood function are determined. These roots are the stationary points of the 
likelihood functional, one of which is the MLE. The equations for the speciali- 
zed models also result from the equations of the general one by inserting the 
known parameter values. 

Now we describe the likelihood function. With fixed 6 € IR1 and y := [o,, 
6,, 6:] = 0, 2; := [%;;, yi;] has the covariance 


Di) = is < Si me ‘e ol =: 2(6, 7) (2) 
Further, let 

i = [8;, « + Bi], (3) 
and for X € MF let 

La, B, Bn), ~) 

:= —(m/2) log det [2] — pe ~ (zi; — mi)’ 2 (zy — mid/2 (4) 
The log-likelihood function is then 

L(y) = L(o, B, An 2(B, 7) (5) 
with the parameter 

v == [x B Hm, 7) € yc IR" (6) 


where y is the domain in IR"*® for which y lies in 
(R+)? =P := {y | o; = 0, 0, = 0,0, = 0} (7) 


and 28, y) is positive definite. 
More suitable for various purposes is an equivalent representation of 1,: 


1, = —(m/2) log det [2] — (1/2) tr 2-18, : (8) 


— BD md ) Gi Fa - (9) 
G4) 
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A further representation results from splitting of S into a summand which is 
independent of y, and one that only depends on the observations ij OVEL Z;: 


S = Wz + S(Hn) (10) 
Ss (ty — 21.) Gy = 21.) (11) 
0) 

Sm) = dmilzi, Mi) (Zi, — mi)’ macl2) 


3.2.1.2 Replicated observations 


Now let at least one of the m; be greater than one. Thus there exists at least 
one repeated observation. Then the likelihood function is bounded. To show 
this, we notice first 


maxl,< max max {—(m/2) log det [2] — (1/2) tr [2-18]}} (13) 
y ZEMF BAM 


All items of S are at least positive definite. Let m;, > 1; then 


1, S max {—(m/2) log det [2] — (1/2) tr [2-48;,}} (14) 
z 
with 
Si = 3 ut — Fi.) iy — Fi) (15) 
j= 


But, since the z2;; are normally distributed, S;, is almost surely positive 
definite, as is obvious from [A 3.14]. Hence the right-hand side is bounded and 
attains its maximum for (cf. [A 3.15]) 


5 = S;,/m. (16) 


Thus the likelihood function Z,) is almost surely bounded. As we will see, 
the MLE also exists almost surely and can be obtained in several steps. First 
the parameter region is extended and the maximum of 1, determined there: 


max max1,(a, B, Bn), 2). (17) 
#.B,Pmy) EMF 


Let &, B, bn); s be the solution. Let # be fixed. Then S uniquely determines a 
y(2') by the relation 


zp == (7 3 (18) 
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various important models as,’ gystem of linear equations for the unknowns 

is nevertheless important. 

The uniform treatm~ 

We insert the know; 

into the likelihoe™ 

case Q; is knor%2> 

as the may’ 

the rele’ * + Bio; = bs. 

hoog ) € (IR>)3, then we have a solution of the initial problem. It will be shown 

Dat the maximization problem with the extended parameter region has almost 
surely a uniquely determined solution 8 = f,, and moreover only one further 
stationary point f.. This solution was given by Cox (1976), who omitted the 
elementary but complicated calculations to solve the likelihood equations. 
We will obtain Coz’s solution from Theorem 3.2.9 of (Anderson, 1951), which 
contains the multivariate form of the problem. Hence f = f, is determined 
as a solution of the eigenvalue problem 


(19) 


(Bz —4,Wz) [Be -1]=9, O<A<h. (20) 


Equivalently we have 


(1, 8) Wz Bz[8, —1] = 0. (21) 
Then we can use 
W;' = det {Wz}? ( Yy oe : (22) 
Wry Wy 


We obtain the quadratic equation given by Cox for 6, where on account of the 
condition A, < A, — resulting from Theorem 3.2.8 — the smaller one of the 
two solutions has to be chosen. This was pointed out by Theobald (cf. Cox 
and Dolby, 1977). 

In case the y(2’) defined from this solution is not in the admissible parameter 
region, the maximum is allocated on the boundary. This follows from the 
above-mentioned property of the extended likelihood function that there 
exist only two stationary points of 1,, one of which is the global maximum and 
the other gives the global minimum. If the maximum for the restricted likeli- 
hood did not lie on the boundary of (IR=)? then there would have to exist still 
another local maximum of 1, with $3; in the interior of (R=)’, the l,-value of 
which exceeds all 1,-values on the boundary. The corresponding value of {3 
would give a third stationary point of the extended likelihood, which is im- 
possible. 

Comparing the maxima over the three boundary components of I” provides 
the desired MLE. In a model with equal numbers of replications, Dolby (1976 a) 
got the corresponding equations, but he did not investigate the MLE on the 
boundary. The boundary maxima of the likelihood function for o5 = 0 and 
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o, = 0, respectively, are simply obtained from weighted LSE in the correspond- 
ing bivariate linear regression models of y over x, and a over y respectively. 
The boundary MLE for o; = 0 finally requires the solution of an equation of 
the fourth degree in f (cf. Cox, 1976). The MLE for LIFU* with replications 
are collected in Table 3.2.1, where we have the following notation: 


B(B) = by — 2Bbry + Bbz,, 
W(B) ee Uy a 2BWay as Bw, , (23) 
T(B) = BiB) + W(8), 


oe Wye. — Waby — ((w,b, — wybz)? — A(Wyyb, — Wybry) (Wybay — Wzyby)) 
2(Wayb, — WzDzy) d 
(24) 
Pe (Bw, e Wry)|W(B), ; (25) 
q = (wy — Bwry)/W(B). (26) 
Table 3.2.1. MLE for LIFU*+ with replications 
para- inner MLE o, = 90 6, =0 o: =0 
meter 
B B BxylOz by|Oxy B 
8 WG. —9..+ BB.) +9%), Fi, %,.+(beylb2) ¥i.—9..) 9; 
Oé (Way —pqB(B) \/B Wy wybiy/b5 0 
05 pT (B)/B 0 BE Org Oy (Wyt(8b,—B gy?)/B( ) 
O; qT(B) b= 02 ,/Oe 0 Wy +(by—Bbxy)/B(B) 
For o ;= 0 we obtain f as the solution of the equation 
—y W, (Bb, Tc) bry) | BiB) = (by rT Bb,,) (Bbzy) (by icc pb,). (27) 
In this case, 
1. = —m(1 + log 2x) —— | Wy + Bho aa 
ie E ‘ BO) 


(b, ae, Boy)? 28 
«fon + Bp). }) oe 
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For o; = 0 and o, = 0, we have 


ue 

Ll; = —m(1 + log 2x) — ce log (». ( y— ald (29) 
2m b, 
1 b2 

1, = —m(1 + log 2x) — — log |w, |b, — —*)). (30) 
2m by 


Finally, let 


Hence, if the MLS is negative for one of the variances o, we have to choose 
that boundary MLE for which the corresponding likelihood value is maximal. 
From these equations we obtain on a unique basis, by further restrictions on 
the parameters, ME for special LIFU and statements connected with this 
that have been derived in different ways before. 


1. o, = 0 provides the LIFU* with repeated observations of an experimental 
design. Here one obtains 


é; = S3/m, a, = S;/m. (32) 


Barnett (1970) obtained an implicit system of equations without giving 
explicit formulae. 

2. For #; = & one obtains a LIFU- model known as a ‘structural relation’. 
One obtains (cf. Dolby, 1976a, p. 43): 


b= &., 
a+ pb =%., 


(In the classical approach (Kendall and Stuart, 1961, p. 379) this equation 
is obtained by a heuristically founded ‘sufficiency argument‘, which fails 
for known o;, o,; see Remark 3.2.2). 


(33) 


Chan and Mak (1979b) treat another interesting model with a different 
structure of replications: 


hi = & + Bes, §: ~ N(4, a2), 
ay =&+6;, by ~N(0,0,), 
Yi; = Ni Bays €i; ~ N(0, o,), 


where 7 = 1, ..., mp and the random variables §;, d;;, €;; are independent. But 
take into account that this model.can not be obtained as a special case of the 
previous one. This is immediately obvious if we calculate the covariance 
D(x;;, ei) for k == 7 in both models. In the starting model we have 0, whereas 
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o; occurs in the one mentioned last. The MLE satisfy the equations (33) that 
were already obtained in the above model for the case of 0; = 3. The MLE 
for B is one of the roots of a polynomial of fourth degree. The boundary MLE 
are not studied in Chan and Mak (1979b). 


3.2.1.3 Observations without replications 


If there are no replications, then m; = 1, 1 = 1,...,n, 24; = 2; In this case 
the solutions of likelihood equations were discussed by Dolby (1976a), where the 
details of the following discussion are to be found. Without further additional 
assumptions it results from the likelihood equations that 


0 = (6, + 6°65)/det [Ay]. (34) 


This equation has no solution for real 6 (see Table 3.2.2). If the variance ratio 
o;/65 =: 0, is assumed to be known, an equation of the form 0 = 9g(f, o;, o,) 
follows which in general can not be satisfied by consistent estimators f, o:, o,. 
From this, for LIFU* i.e. for 9, = 0, the equation B20, = o, results, which was 
derived already by Lindley (1947). The unboundedness of the likelihood func- 


Table 3.2.2. MLE for bivariate normal LIFUt 


Ej, ~~ is Or)s 03; ~~ N(0, 5), &;; ~~ N(0, 6), 01 = 6:/05, 02 = G,/c5 


with replications without 
unequal Table 3.2.1 without additional “1? MLE* 
number of information 
replications 0; OF g, known — MLE 
Q, and e, known (35), (37) 
0, — 0,0, +0 (38) 
Cio; 
(33) O,, 63 known (39) 
01 known (40) 
LIFUt (o; = 0, = 0) 
equal (32) without additional “7 MLE 
number of information 
F plications o, or o, known “7 MLE 
0, known (37) 
o; and o, known, B from (39) 


Spee ee eee) ess oe eer 


* — MLE: there exists no global maximum of the likelihood function 
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tion for these LIFU* (Anderson and Rubin, 1956, pp. 129—130) corresponds to 
the inconsistency of MLE shown by Lindley. In fact, the likelihood equations 
determine only a saddle-point (Solari, 1969). 

It is also insufficient to know the ratio e, = o,/05 since in this case equation 
(34) follows in the same way. But if both ratios are known, the following likeli- 
hood equations are obtained. They are also valid for LIFU*, i.e. in the case 
0, = 0 and when all variances are known. We denote 


.t = I/(o, + B?o5), 
g(B) = TryB? + (02d, — dy) B — O24 ny, (35) 
e=y—a-l, — Bz. 
Then it holds that 


68=a-+ Bé,te, (36) 
O = osBlle.|!? (02 + B?) + 2n(01(e2 + 82) + @2) 9(B)- 


Under 0, = 0, for § one obtains the equation 0 = g(f) which has been known 
for a long time. Madansky (1959) showed that the greater of the two solutions 
provides the MLE: 


B = ((d, or 02d ,) sil ((d, ee. 0od,)" 3 4oodzy)"!”) [2d ry. (37) 


In case of only one known error variance under LIFU* the likelihood function 
is unbounded, except for the regression cases o, = 0 or os = 0 (Moberg and 
Sundberg, 1978). For e. = 0, e, += 0 — which might be considered as a ‘quasi- 
regression model’ — one obtains (cf. Dolby 1976a): 


bd). | (38) 


This is the geometric mean of both regression estimates of x over y and y 
over x, respectively, which was heuristically derived by Teissier (1948). 


Case 0; == 0.4. = 1S. et 


If both error variances are known and #; = #, then, according to Barnett 
(1967) one obtains the MLE from the equations 


B == ((d, oa: 02d ,) =e ((dy ae Ood,)* te 4osd.zy)"?) [2dy, 


=%, &=J, —6z,, (39) 


> 


6: = (01d, + 20,04, + B2dy) — 0,(6205 + o,)/(0, + B2)?. 
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But in Barnett no notice was taken of the possibility 6, < 0. In this case the 
MLE lies on the boundary ‘o; = 0’. This corresponds to 0, = 0 and the MLE 
is obtained as in the model in which the 9; can be different. But, finally, the 
same formula (37) results for 6. With regard to this model there were consider- 
able discussions on the ‘overidentified’ likelihood equations (cf. Kendall and 
Stuart, 1961, 29.9—11). They argued that the MLE could be obtained by 
‘solving those equations which equate sample values with the theoretical ex- 
pectations’. This would be based on the sufficiency of sample values for para- 
meters. But, in the case that both error variances are known, five equations 
are obtained for only four parameters, and thus there is an ‘overidentifiability’. 
Kiefer (1964) referred to the fact that the correct application of the likelihood 
principle does not cause overidentifiability. 

Birch (1964) also started from overidentified equations. He used the right 
likelihood principle only in those cases where these equations did not have a 
solution. In some of his complicated derivations he found the same solution 
as Barnett (1967) but, apparently, he did not correctly treat the case 6;= 0 
(cf. Birch, 1964, situation (vi), p. 1176). 

If 9, is known, one obtains the same formula for f. The solution for 6; is: 


(cf. Madansky, 1959). 

The formula for B remains for LIFU*, i.e. for 9, = 0. If all variances are un- 
known and « is known, the MLE was derived by Chan and Mak (1979a). 
Similar to the model with replications, here the MLE is obtained by investigat- 
ing the stationary points and the maxima on the boundaries of the parameter 
region o; = 0, 65 = 0, and o, = 0. 


3.2.2 Maximum likelihood and least squares estimators 
3.2.2:1 Estimation procedures for models with errors-in-variables 


In Section 3.2.1 the MLE were investigated for the most simple bivariate linear 
models. In important special cases explicit formula were obtained to estimate 
the structural parameter. For more general models this can not be expected. 
In this section the two most important estimation procedures are to be des- 
cribed for models with errors-in-variables. Both result from the minimization 
of an estimation functional over the parameter space 


(z) = arg min 1,(y) (41) 


pe! 


For more general models with errors-in-variables we had y = [mn), 2, y] 
€ S, < I’. For explicit models we have y = [&(n), %, 7]. 
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3.2.2.2 Maximum likelihood estimation 


Let the densities (or their logarithm) of the distribution of $:m) be given by 
i(-), y € I’. Then the MLE for y is obtained from the estimation functional 


L(y) = —E(z — yw) (42) 


Under normal distribution with unknown covariance 2 = Q(y), y € I’, we 
have 


L(y) = log det [Q]/2 + lz — wIB4/2. (43) 
The MLE result from 


}? =arg min minl,(y). (44) 


[4,.2]eS, ver 


3.2.2.3 Least squares estimation 


With the Euclidian norm |]-|| = ||-||; the method consists in firstly defining the 
sum of the squared distances between the observations z; and a structure St, 
fixed for the present. After that we have to search for a structure from the 
model which minimizes this distance, with respect to the state variables 
observed with errors. Let # = [n), 2]. Then the LSE is obtained from 


6 = arg min [len — Mn l?,_——- (45) 


[u,7]ES, 


where w = wn) is the known vector of regressors. This minimization can also 
be written as 


$= argmin min |e — pl: (46) 
’ TEM BG) ESwx 

As is well known from regression analysis, quadratic distances different from 
the Euclidean are sometimes more suitable. Let W € Itz, be a weighting 
matrix and ||-||,, the corresponding quadratic norm. This yields weighted LSE 
(WLSE) with weighting matrix W. If W~! is the true covariance of [(,), ge- 
neralized LSE (GLSE) are obtained. As will be seen, the LSE for the regression 
models are included in a natural way (cf. equations (3.1.50) —(3.1.53)). Ob- 
viously, the following statement holds. 


Theorem 3.2.1 For normally distributed errors (n) with known covariance the 
GLSE is MLE. 


Remark 3.2.1 For LIFU*, Sprent (1966) introduced another method also 
called generalized LSE. With the residual variable é = y,, — (I, © B) He 
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the following ‘Least squares criterion’ can be introduced heuristically : 


k,(a) = min é'(H, 86’) é. (47) 
nell 

Dolby (1972) showed equivalence of this estimate with the WLSE for known 
covariance under normal error distribution. The graphic interpretation of 
WLSE is possible for the case Q = I, © I, (cf. Figure 3.2.1). Hereby St. 
has to be exactly that structure from the etedel St,, which minimizes the sum 
of squares of the vertical distances from the observed points z; to St». The 
WLSE with this special weighting matrix is called orthogonal LSE (ORLSE). 
For bivariate LIFU* the WLSE lies between both lines of regression (cf. 
Figure 3.2.2) for every nonsingular weighting matrix. 


rt 23 


a Ste 


Z2 Fig. 3.2.1. 


& Fig. 3.2.2. 


The regression lines are the WLSE for the o, = 0, which leads to G,),, and 
for o, = 0. For these lines, the squares of orthogonal distances are no longer 
minimized, but rather the sum of the squares of distances in the direction of 
the 7- and &-axes, respectively. The orthogonal LSE lies between both lines 
of regression which is to be seen from the computation formula. We have (cf. 
(3.1.37)), 


p =((d, —d,) + Vc, — 4)? + 4d2,)) 2d ry 


Byjz = mlGes Pole = dy|dry- 


Using the inequality dz, < d,d, (Cauchy’s inequality) implies that the numera- 
tor of the root term occuring {os B is not greater than 


(48) 


17 Nonlinear Regression 
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Thus for d,, > 0, 


es Bay « (49) 


/ 


The other cases are treated in the same manner. 


3.2.2.4 Measurability and uniqueness 


At this point we will deal briefly with measurability of MLE and WLSE. These 
are obtained under weak assumptions in the general context of minimum- 
contrast estimators (cf. Pfanzagl, 1969; Strasser, 1973; and the literature 
mentioned there). It is sufficient that 1, is continuous in both the arguments # 
and z and for every z the minimum is attained for a finite parameter value. 
Even if the estimate } is not unique since several minima exist, we can choose 
one of them to obtain a measurable function of z (cf. Witting and Ndlle, 1970, 
3.32). The notation \ 


} = arg min, 1,(9) 


should also be understood in this sense — as a measurable choice of the mini- 
mum. Anyhow, in practical cases the assumptions of the general theorems can 
mostly be taken as fulfilled. For this reason we will not go into further details 
with respect to the question of measurability. 

The uniqueness of LSE for very general curve-fitting models has been shown 
in the fundamental paper of Pazman (1984). The proof is based on arguments 
from differential geometry and therefore beyond the scope of this book. The 
methods for previous less general results by Héschel (1978b) have been ex- 
tended in another direction. They are used to show the global identifiability 
of the structural parameter in most practically applied models (cf. Theorems 
3.1.4 and 3.1.5). For LIFU*, measurability and uniqueness of MLE follow 
directly from their sepresentation as solutions of certain eigenvector problems, 
as can be seen in the following section. . 


3.2.3 Linear functional relations with nonrandom experimental design 
and known covariance 


3.2.3.1 The model 


In Section 3.2.1 we showed that a unified approach to bivariate LIFU is 
possible. At least for normally distributed §; this possibility would also exist 
for the multivariate case. The method described in Section 3.2.1 for the com- 
putation of stationary points of the maximum likelihood equations can in 
principle also be used for the multivariate case. However, until now muiti- 
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variate LIFU~ has not achieve the importance of corresponding LIFU*. 
This is, of course, caused by the more complicated computations necessary to 
determine the MLE even in the normal case. The basic knowledge on diffi- 
culties of this model was already provided by the bivariate model (cf. Section 
3.2.1). Moreover, in contrast to the case of univariate §; the assumption of a 
multivariate normal distribution for the experimental design is practically less 
important. As already indicated (cf. equation (3.1.50)), it is difficult to define 
the distribition z = w+ § for nonnormal uw. That is why literature on multi- 
variate LIFU has nearly exclusively treated LIFU with nonrandom experi- 
mental design. Hereby the MLE derived in Section 3.2.1 for LIFU* result as 
special cases. 

LIFU* with known covariance are treated in this section. The assumption 
of a known covariance is not satisfied for most practical applications, to be 
sure, but the corresponding investigations provide devices for the construction 
of two-step estimators. Here we will also deal with the case of a general co- 
variance Dz = 2 € Mn. In difference to the case of uncorrelated single ob- 
servations, Dz = I,, ® 2, time series and other models can be covered. 


3.2.3.2 Least squares estimation 
Consider the LIFU* (cf. (3.1.35)) 
ea An ee ee eka, ,, Ds =a. (50) 


With the distribution assumption § ~ N(0, 2), the MLE and WLSE (with 
weighting matrix 2+) coincide as shown in Section 3.2.2. They are obtained 
from 


mink,(u) = min min k,(u) (51 
HELE yg L£E<p—q uel” ; 


with 
k(u) = lle — Auli. 
There exists an inner minimum for fixed # € C<,_, at the point 


f(L) = Q-2. AML) = Pgsng gn Q-U*2. (52) 


As the projection continuously depends on f and since Y<p-, is compact 
according to Dieudonné (1976, 16.11.9), we have the following result. 


Theorem 3.2.2 For the LIFUt, the WLSE is obtained from 


k,(w) = |lellg-1 — max ||u()ll7,, 
Lel<p—q 
(The WLSE for a special unknown error covariance with LIFU* will be given 
in Theorem 3.2.9). 


thgfe 
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For the case Q = A ® 2, A = U' & I,, a reduction to an eigenvalue prob- 
lem is possible, then we have 
Pong gn = Pyanyex-ng = Pasay: © Prang. 
But because of ||(4 @ B) z||} = tr (B’Bz(A'A) 2’) with Z = z we have 
IWa(£)|? = te (ZY2P yn gEAPZAAPP pay AUZ!). 


Now let 6 = (€,, -.., pq) be an orthonormal basis of »12F, Then the WLSE 
is obtained by the solution of the problem 


max tr (E@’2-V?2Qd-1/2) 
C'C=Iy-g 
with 
@ =Q7.y = ZA2U'(UA1U')1 UA’. 


The solution is well known (cf. Rao, 1973, I, 1. f. (iv)) in case that 7(Q) = p — q. 
Then we have 


C = c* 
where 
eC eC ire (Gs an Can Cac genio) 


and the C; are the eigenvectors of Q belonging to the eigenvalues 
AQ) S2-S4,(@); Q = S-U2Qr-1W2, 


Now let n = p. With 7(P 424) 2 p and consequently because of the absolute 
continuity of the distribution P*, the random matrix ZP 4-12,Z' has full rank 
almost surely and distinct eigenvalues (cf. [A 3.14]). This implies that 
L (Cp, =.+5 Cgi) 18 uniquely determined almost surely and in the model LU pog 
the maximum is attained almost surely for r = DRE 


Theorem 3.2.3 For LIFU* with A=U'@I,, n2=p, Q=A@Z the 
GLSE for the structural parameter £ € Le,_, 1s almost surely 


He —- DIE (Oy DOr) Cosi)» 


where Cp, ..-, Cai, are the eigenvectors belonging to the p — q greatest eigenvalues 
Of, Sah ?2 ACME ER eta gtee een caae, P? is uniquely determined almost surely. 


The value of 1, (A) is 
: | 
log 1,(2) = OP log 2 = & los det [A] det fst = AQ). 
2 Py 2 j=qt1 


For this compare Remark 3.2.4 in Section 3.2.5.) 
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The fact that, for the structural bundle RQ, 4, the WLSE ? has almost surely 
the largest admitted dimension, leads to inconsistent estimates in models in 
which the ‘true’ structural parameter indeed has a smaller dimension than 


p —q. Furthermore, notice that Q is closely connected with the regression 
model 


Z=MU+E 
Hereby the BILUE of & is 


~ 


M = ZA-\U'(UA1U’)1 U, 
so that 
Q = MAM =Q,y 


and with the matrix of the regression residual Sz., it holds that 
Qz.u = S3z.y —ZA4Z’. 


Thus the results on the theory of multivariate linear regression form the basis 
for the treatment of LIFU*. (This is similar to the case of unknown covariance 
matrices.) 

For completeness let the MLE be given for the model (3.1.34). Due to results 
in Section 3.2.2, the MLE for f+ is 


go O. 
Now apply 


(LU27) 4 es Dy Supiyenl 
and 
R(C*)* = RC,) 


Since the eigenvalues are almost surely different, one obtains the following 
result. 


Theorem 3.2.4 For LIFU* 
2=(U'@i)ntt, Doe—A@z, 
iO, 1 Dt ete 
the WLSE is obtained for L+ in the form of 
f+ = 5-120,, 


where Oy = (Cy, --+) Cy) are the eigenvectors belonging to the q smallest eigenvalues 
of S-¥2QS-12, With L any other matrix, L € Mix. is WLSE of KL+) = KL). 


Nevertheless, R(L) is almost surely uniquely determined. 
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We have 
log L(A) = ay log 2x — — 5 lee det [A] det [2] — =. s A(Q 


4=1 
This also holds for the model 
Eb” € Oregtpxa: 


Notice that + = G, if 6, consists of the eigenvectors of (Q — 1,2) q% = 0 
belonging to the smallest eigenvalues. For general covariances a computation 
is not so easy but is possible in principle due to Theorem 3.2.3. MLE then have 
to be computed with the algorithms for general models (cf. Section 3.2.6 and 
3.2.7). The following theorem for the LIFU* with £ € 2=,-, helps to shorten 
the computation since the dimensions less than » — q need not be considered. 
The theorem is formulated for WLSE. 


Theorem 3.2.5 Let P* be an absolutely continuous distribution, n = p(p — q). 
Then the WLSE ? of £ € Say-q is almost surely contained in B_ pq 
Proof. Assume a WLSE # € &,,7r < p — q would exist. Then there exist some 
subspaces /,, £, € 2,., with ¥ —F,,£ —f,, F, + F£., such that 


he (f; n £,)". 


Since f should be the WLSE for the whole model Y,-, it follows, taking 
into consideration # < f;; 7 = 1, 2, that 


ued; uel” 
Because of that, and since 2.,_ -9 = Lup» both #, and #, would be WLSE in 
Q_,_¢- Hence, because of w € (£, 9 £2), the WLSE A(z) for w would not be 


identifying for the structural parameter f. 
This can be the case only on a zero set of 2 (cf. Hdschel, 1978 a, theorem 6.2). 


Theorem 3.2.6 Let any distribution from P* be absolutely continuous. Then 
the WLSE for # € S<y-, are obtained as Lf = KL+)+ where 


palettes)” fe sim fal ech=) 


He xa 
holds and fi(£) is defined according to (52) (cf. Nussbaum, 1976, theorem 4.3.1; 
Hoschel, 1978 a). 


The use of this corollary consists in the fact that the minimum problem does 
not have to be treated for all matrices L+ from the sets M,,,, 7 2 q, but only 
for these with r = q. Thus it is possible in principle to compute WLSE ? but 
in the case Q = A @ ZX the computation of WLSE is difficult. 


f 
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3.2.3.3 Hquivariance 


The equivariance of WLSE can also be proved for general covariance Q. 
First define equivariance for the case A = U’ © I. Let § < Me xp be a group 
of regular linear transformations on IR?. Under G@ € & the LIFU* model trans- 
forms from (cf. (3.1.38)) 


Z=MU+S, DS=2 
to 
GZ=GMU+ 6 
or 
Z=MU+E, DE=G4,26,,, Gn =In ®G. 


Now it is sensible to demand that the estimate correspondingly follows that 
transformation G over the structural space. Above all, that is the case for 
models which are formulated coordinate-free with geometrical invariant 
terms. In that case equivariance would be defined by 


where the index 2 expresses the dependence of the estimate on the model 
parameter 2. This definition is especially obvious for the case A = U’ & I. 
For general A € Mnpx np» m = n, the following commutation property has to be 
assumed : 

A) GP, OG) A, VGES. (53) 


Then the following theorem holds. 


Theorem 3.2.7 The WLSE are equivariant for every group & of regular linear 
transformations with the commutation property (53). 
Proof. Due to Theorem 3.2.3 the WLSE of the transformed LIFU* are solu- 


tions of 
| P5-r0 gen O22I[” = Ppa 4 gnQ¥!22|2, VE € aa (54) 
Now it holds that 
B71 = (G,,26,)-) = W014), 
(55) 
7 amet OPA a CoN oe Gr. 


Then, with the commutability (53) the well-known representation of projectors 
P, = L(L'L)" L, R(L) = £ for all # € &,-, implies 


| Bote ge 2 tz = |Po- agri enQ-122||? (56) 
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or, similarly, because of G)1£" = (G-1L)", 


[Prnalead Met PParn ash 7 


for all £ — R? with # = G-1Q,_, for at least one £ € &, 4. But, the last set is 
Q,-q itself. This implies 


Po = GIL 5. Pa 


3.2.3.4 Linear functional relations with nonrandom nonobservable variables 
and linear regression part 


Concluding this section we still remark on the computation of WLSE in general 
LIFU* with linear regression part (model (3.1.38); consider the identifiability 
condition of Theorem 3.1.3). Then we have 


k(u, w) = min min |e —(U’ @ Ip) wu —(V' @ Ip) wl (58) 


eMeQe_ | uci, 


For 2 = A ® & the inner minimization problem is an ordinary multivariate 
linear regression. It follows that 


A! = P prnyry yn QZ, (59) 
consequently 

jue) = M,(z, ph, Q) = D-U2Z A-1V"(VA-1V")-2 
with 

DTM ee 


and thus, in a compact way of writing (cf. 3.1.3), 


k,(a) = min ||2 — (U' @ Ip) wo || (60) 
‘ wegr 
with 
Z=Z—ZAV'(VAY')-1 =: Z/V 
and 


U = U — UAV'(VA-1V')! =: UV. 
With this new model we can proceed as above. (For further modifications of 
A, and A, one proceeds analogously (Hdschel, 1978 a).) 


3.2.4 Linear functional relations with nonrandom experimental design 
and covariance known up to a factor 


As explained in Section 3.2.1.3 the likelihood function can be unbounded 
for LIFU* if no further restrictions are set on the parameters. In these cases 
no MLE exist and MLS yield inconsistent estimators in general: In bivariate 
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et ce TE SS er ral are ale aa a a oe hd 
LIFU* it is not sufficient to know one variance ratio; only with two known 
variance ratios one obtains MLE. Thereby we get LIFU* if o;/o, = 0, or 
o; = 0. Then the knowledge of the other ratio 05/0, is equivalent to the know- 
ledge of the covariance up to a factor. In case of general multivariate LIFU*+ 
this additional information is also sufficient for the boundedness of the likeli- 
hood function, as will be shown in the following. But this does not imply the 
consistency of MLE. Whereas consistency of MLE holds for the structural 
parameters, the corresponding estimates of the covariance factor are incon- 
sistent. For bivariate LIFU* this is to be found in Kendall and Stuart (1961, 
29.19), and for multivariate LIFU* in Gleser and Watson (1972). 

Now we show boundedness of the likelihood function and then we will give 
the MLE. We consider LIFU* with linear regression part (cf. (3.1.38)). Then, 
for DOE = o-I,o€ R> we have 


I.(u,0) & o®? exp (—lle — Apl|2-s/2o} 
with 
Kan) = [M4(n,)1 Mn,)2] € aK IRES 


ee Ae (AG | Ae Wenn 68 


According to Theorem 3.1.3, £ is identifiable if r14] = np. To prove bounded- 
ness of 1, it suffices to show that the vector norm occurring in1, has a positive 
lower bound. 


1. case: r[A] < mp 


This case describes replicated measurements for A = U’ ® I,, U’ = Diag(( Taine 
If r[A] < mp then at last one m; > 1. 
Thus &(A) is a proper subspace in IR?”. But then we have 


je — Aliza > lle — PP2/3-+ = 0. 


2. case: [A] = mp 


Because of 7[U] = n = m, the matrix U is regular and thus z = Ay is equi- 
valent to 2 = A-lz = wp. 
But 

= [Meni Menges Mey © £5, LE Sapa? 


According to [A 3.14], for n, = p — q, the first subvectors of 2 are almost 
surely not contained in any subspace £ € <p, since with P* < /?™ we 
also have P* < 4?™. Consequently 2 is almost surely different from yu and 
according to [A 3.14], for w € £" X IR", £ € Lapa 


max 1,(u, o) = max max 1,(u, 0), 
L,o u o 
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where the maximum is attained for 


A 


1 
6(u) = a lle — Apllg-. 


The following maximization of 


1 


can be carried out as described in Section 3.2.3. As special cases one obtains 
bivariate LIFU* and the cases 


A= (U's LAO uae Gs (cf. Nussbaum, 1976, 5.) 
with DO = of, 6.2 (cf. Casson, 1974) 


A=I1,,®Ip, M2 =9 (cf. Gleser and Watson, 1972). 


3.2.5 Linear functional relations with nonrandom experimental design 
under independent normally distributed errors 


Once again we will immediately treat the general LIFU* with linear regression 
part. The representation is based on Anderson (1951a, section 2). The model 
is given by (cf. (3.1.38)) 


Z=M,U+M.V+S, L''M,=0, 
(61) 
Q€T = (In @ Z| TEMP}. 


WO was treated in Section 3.1.4, Theorem 3.1.3. According to this 
we have r((U V]))= 2. 
We put M = (M, 1 M.), W = [U; VIE Maxims % = Ny + Ne (cf. Section 3.1.3). 


As distribution model we assume 

5 = Sim OM = {N(0, 2) | Qe T}. (62) 
For Q = A @® &, by transforming the model with A-1/2, ie. 

Z := ZA-'2, 0 := UA?, 
one can obtain that the transformed error covariance is 


Q=1, @ =. : (63) 
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For unique parametrization of the space R(L) = f, only such matrices L+ 
are considered for which 


bs S15. (64) 


With Z+ also L*G is in the admitted parameter region for all orthonormal 
matrices G € My. 

To obtain the MLE all stationary points of the likelihood function are to 
be computed. The MLE is among them if the likelihood function is bounded. 
This is shown as in regression theory. 


Theorem 3.2.8 If P* is an absolutely continuous distribution, then 
k,(M, =) = —n log det [2] — tr 2-182’ (65) 


is almost surely bounded form —n 2 p, M EMyxm, XE Ms. 


Proof. Because of [A] = 7[W’ ® I,] =n < m, &(A) is a proper subspace in 
IR", and because of Q-/24 = W’ @ 2-1? it results that 


|2-M22 — Q-VAul = (LZ — Pony) Q-1?2|/? 
= |(Py @ 1) (LZ @ 2-1). (66) 
Vor 2 := (Py 2-12) z, ee 2-123 Py1, holds and thus 


max k(M, Z) < —m log det [2] — tr 2-3/2 Pyiz’ - 
M : 

Because of 7(Py-1) = m — n ,it follows, according to [A.3.14], that S := ZPy4 

xZ' €M-> almost surely. Consequently, according to [A 3.15] we have 


max max k(M, 2) < —m ices det [S] + «< ow. Hi 
m 


= M 


The boundedness is not valid for 7(W) = m = n. This has been shown by 
Anderson and Rubin (1956) and Solari (1969) in their results on the unboun- 
dedness of the likelihood function for bivariate LIFU*. Examinations on sol- 
vability of ML-equations for the multivariate LIFU* with linear regression 
part are to be found in Florens et al. (1976) for the case m = n, and in Willassen 
(1979). For m > n the MLE can be obtained from an eigenvalue problem similar 
to that in Section 3.2.3. 


Theorem 3.2.9 (Anderson, 1951a) The MLE for LIFU* with linear regression 
part result from 


She = Cy = (C4, ..., Cy) € Moxgs : (67) 
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where the c; are eigenvectors from 
(Q =m AS) CE = 0, 
S cam Sz.w> 


Q=0,¢, T= UI —V'(VV')-1 V) = UP,,- 
witht 


D= Diag (A, ae) Ap)» Ay S200 Ape C= (C;, C*) € Mx: 


The eigenvectors are normed according to 


C'SC =(I,+ D)y*, 8 =Sjm. 


(68) 


(69) 


(70) 


(71) 


The p —q +1 greatest eigenvalues i; are almost surely different. With L+ any 


orthogonal transformation is also MLE of Lt, if it satisfies (64). 
With 
D, = Diag (A,; <s.544) 


3=8 + SL(1, + D,) DL'S 
M, = (1, —St+L') M,, 
M, = 8,757 
holds and we have 
M, = (Szv — M,8py) Sy’. 


The maximum value of the logarithmic likekihood function is 


q 
(M1, S) = —“P + AP hog 2n— = log det [8] — = ¥1 
(M, 8) = —"P +P log 2n— F log det [8] — “FY log (1 
Proof. For the logarithmic likelihood function it holds that 
ae 1 
LAM, 2) = log 2% — = log det [2] — BR tr 5-10’. 


In connection with the restrictions one obtains the Lagrangefunction 


i=14 tr(AM{L*) + - tr(B(LY'EL+ — 1)) 


(72) 
(73) 
(74) 
(75) 


(76) 


+ 4i). 
(77) 


(78) 


with Lagrange multipliers A € 7.,,, B¢ Mi. With the partial derivative 


with respect to Lt, 
AM), + BL'S=0 


(79) 
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follows for the stationary points of J, and because of (63) and (64) we have 
AM{L* + BI SL = B=0. (80) 


Now we assume Syy = 0. Otherwise, instead of the initial variables we apply 
transformed ones: 


G=UI—P,), MM, = M,+ M,S8yyS>". (81) 


Then we would havef = E = Z — M,0 — M,V. Because of (80) and Syy = 0, 
the partial derivatives of lz with respect to Y, M,, M,, and L+ lead to 


m& —~to' =0 (82) 
WS eM oa ed = 0 (83) 
Z-1Szy — T-1MSy = 0 (84) 
AM, =0. (85) 


‘The solution of (84) provides M,. From (83) one obtains M,, since the multi- 
plication with 2’ and 2" provides, with (64), 


A = LI"'S8zy. (86) 
With that, from (83) one obtains 
M, = (I — ELLY) 8qySG, (87) 


and after replacing variables according to (81) this implies equation (74). 
Because of (86), (87), and from the definition of Q one obtains, with (85), 


(a LL) OL = 0, (88) 
Then (88) implies 
mS = 8 + SELYQI4+L"S. (89) 


Within the set of matrices admitted for Z+ we choose — if necessary after 
orthogonal transformation — those ones with 


PAP uOnue fs Dine tia, 1): (90) 


m 
Then (89) implies 

mS = 8 + mSL:DL"'2. (91) 
The multiplication with Z+ provides 

mZL1(I — D) = SL". (92) 
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With that (88) implies 

QL — mZL!:D = 0 (93) 
or 

QL!(I — D) — mZL+(I — D) D = 0. (94) 
Hence (92) implies 

QL! = (Q+ 8) LD. (95) 


Consequently Z+ consists of q of the eigenvectors ¢,,...,¢, of the eigenvalue 
problem 


\Q —4(Q + 8)| = 0. (96) 


These eigenvalues are less than 1, since Q + S =Q. The relation between 
the eigenvalues of (8) and (36) is 


Dy, = DI, = Dy". (97) 


According to [A 3.14] the eigenvalues of (68) are almost surely different if 
m > n. Thus, for arbitrary eigenvectors O = (é,,..., é,) of (68) it holds that 


O'S = Diag (hy, ..., ky) (98) 


with certain constants k; > 0. We choose k; = m, j = 1,..., p. Then, for a 
certain matrix K we have: 


~ 


I'=6,K, K=Diag(k,,...,4,), 6, = (&,.--,&)- (99) 
Because of (92) and (64) the relation between K and D implies 

mI, — D) = L+'SL+ = KC’. 8C,,K = mR? . (100) 
that is, 

RK = ji, —B. (101) 


Now we show that the q smallest eigenvalues of S~!@ have to occur in Dy. 
Due to (91), (92), we have 


mz = S+ =A SLA. — D)-1 DiI -- D)-+ L's (102) 
m 
1s a ae 
=S+ am SC,(I — D) DOS. (103) 


From (98) and (101) we obtain 


OSE =1 +4 [1,10] DU — Dy (1,1 0). (104) 


1 
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Because of (98) it further follows 
a ie e 
det [2] = det [C]-? J] 4,(1 + 4,). (105) 
j=1 
With that, according to (97), (98), we have, as likelihood function, 


1,(M, 5) = — “Flog 2n— ~ log det [8] — ~ Silog tian ve 
k=1 


(106) 


For every choice of the 1; , k = 1,...,¢, this provides the logarithms of the 
likelihoods over the stationary points of J,. Since 1, is bounded the maximum 
is just attained for the g smallest eigenvalues A, ..., 4,. Finally, the obtained 
estimates are transformed according to (81). i 


Remark 3.2.2 The covariance estimate 2 is not always consistent (Villegas, 
1961,for g = 1): 


Remark 3.2.3 There is something obviously common in the Theorems 3.2.4 
and 3.2.9 that is more profound. In fact, for a general linear time series model 
it was shown by Robinson (1974) that with the estimates constructed in such 
a way from the above eigenvalue problem, every eigenvalue of the residual 
matrix €¢’ is minimized. In a unified way this implies very elegantly the Theo- 
rems 3.2.4 and 3.2.9 though with less elementary devices: 

In Theorem 3.2.9 the matrix Q can be given in a still more simple presen- 
tation, providing in particular, a device to explain the relations between econo- 
metric estimation procedures and the MLE. If Sy, W = [U V], is nonsingular, 
it holds that (Laha, 1957): 


Qz..viv) — 92.0 = 82.0.0) 7= Szu.vScvSuzv- (107) 


This reflects the relation between the projectors Py, Py, Pw. For singular 
models (that is, if Sy or Sy or Sw are singular) this decomposition formula 
holds only under additional assumptions. For this R(Sy) = R(Sy,v) and R(Sy) 
= R(Sy.y) are sufficient (Héschel, 1976). That can be graphically interpreted 
by the exogeneous variables U and V. 2(Sy) = A(Sy.y) means, for instance, 
that the largest correlation between U and V has to be less than one (Héschel, 
1974, 5.1). 


Example 3.2.1 For LIFU* with replicated observations we have 


PEED Vie 


U = Diag (1;,,); t= 1,...,n;m = > m;. 


t=1 
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Then U = 0 and Py = Py») = Py: holds. This implies 


S = S7w — 87:0 =, (24; ——<e4)) (zi; — 2;,)' = Wz, (108) 
Laas 


y] 


_ 


Q=A2.u aS) mi(z;, — Z,.) (%. — 2.) = Bz. 
w=1 


Let 6 = (€,, ..., €,) be those eigenvectors of 


(Bz —1,Wz)t;=0, Oi<¥,--- <A, (109) 
with 

Ow, C=ml,, 6 =(6,,6*), (110) 
and let 

Reo (O7 Dye 0* (111) 


Then we have 


and 
i+ = 6,. (112) 


Remark 3.2.4 Starting from Definition 3.1.5, the question arises whether 
the MLE of B can be obtained also for explicit LIFU* instead of the implicit 
LIFU* considered up to now. That can be expected since for bivariate LIFUt 
with the parametrization « = [1, 6] &, only the 7-axis is not directly comprised, 
though it is obtained as special case for B = oo. In the general case eventually 
after permutation of the components of the z; and w;, respectively, let the first 
p — q components of ; be the independent variables &;, i.e. w; = [Ip-q B} &;. 
Consequently if the distribution of the smallest ¢ eigenvectors C, is not ‘patho- 
logical’ the (¢ X p)-matrix will be regular. By the symmetry of the problem 
this, of course, also holds for any other selection of g rows from C,. Now, to 
any QY = Q(2n)) and S = S(zn)) (cf. (68), (69)) there belongs exactly one suitably 
normed (p X p)-matrix C of eigenvectors. 

Under the decomposition C = (C,,C*), C, = (es) Cyl, Ox © Moy g we 
denote the set of those z,) for which the matrix C, is not invertible by Z. 
With that we have P(Z) = P({C | det [C,] = 0}). Now the determinantes 
are analytic functions of the elements of the matrix C. Furthermore, take into 
consideration that the orthonormal (p X< p)-matrices form an analytic mani- 
fold (cf. James, 1954, S. 43). Thus the elements of C' are analytic functions 
themselves within the local coordinates of this manifold. But the set of zeros 
of an analytic function is a null set (cf. Fisher, 1966, theorem 5.A.2). According 
to Girko (1975, corollary 4.3.1 (2)), it finally results that P(det [C,] = 0) 
as an integral of a certain density function with respect to the Haar measure 
over the just described null set of orthogonal matrices. With that, Z itself is 
a null set with respect to the distribution of Z,,). 
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Example 3.2.2 Now we consider explicit LIFUt, £ = A([I B)), Bem 
Then we have 


4X (p-q)* 


KC,) = A(L+) = K[—B' i I,)). (113) 
Hence, for a regular transformation F € M,, 

0,F-1 = [38 tdale (114) 
Therefore, with the partition | 

Cee Opry in Og Macy (115) 
one obtains the equation 


Be —C,Co! (116) 


According to Remark 3.2.4, C, is almost surely regular. With the natural 
decomposition of Sz yw into Sy ySyxy.w, etc., it easily follows for the column 
vectors of C, and C,, that 


¢; = —(Qyv.u.v — ASy.w) 4 (Qrx.u.v — ASrx.w) & (117) 
and 
¢: = (—Qy.u.v — AiSx.w) 1 (Qxv.u.v — ASxv.w) Ci- (118) 


With that, for g = 1, (118) implies a known formula for the LIML in linear 
simultaneous equations which have only one structural equation: 


6, = —(Qx.v.v — ASx.w) + (Qxv.u.v — ASxy.w). (119) 
In general, no such simple explicit formula holds, only 


Bee {((Qx.u.v —A Syx.w)* (Qxy.uv — 4Sxyv.w) G;) dnt C,*: (120) 
The generalization for v > 1 is to be found in (3.3.29), (3.3.30). 


3.2.6 Nonlinear models with known error covariance 


Though in practical applications one can seldom work with known error co- 
variance, such models are of fundamental importance also in cases with unknown 
covariance. Often one will apply two-stage estimators, that is use a suitable 
estimate for the unknown covariance and then proceed as with known co- 
variance. The basis is a model with nonrandom experimental design 


O=s(ui,2), O= p(x), (121) 


= fit op Or emara On 


18 Nonlinear Regression 
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with the distribution assumption 
Cin © P@ = {P, | HO = 0, DE = 2} (122) 


for fixed and known covariance matrix 2 € M7,,. According to Section 3.2.2, 
with given estimation functional /,(-) one obtains estimates #, « from 
[A, #] = arg min 1,(u, 2). (123) 
O=8;(43,%) 
0=p() 

Such extremal problems can be solved iteratively only on sufficiently large 
computers. Modern derivative-free descent methods secure convergence of 
the corresponding algorithms (cf. Schwetlick, 1979). For quadratic objective 
functionals — that is, in particular, for WLSE — one even has global con- 
vergence to a local minimum. But, in comparison with general nonlinear 
minimization problems the special form of the present model implies a special 
structure for the known iteration methods. These will be described in Section) 
3.8. But, for every iteration procedure we always need a suitable initial ite- 
ration for u and az. The coice is difficult. The most reasonable and hardly 
replaceable initial approximation for s,,) 18 2»), or for replicated observations 
Zn). Such a ‘natural’ initial approximation does not exist for z. But this pro- 
blem is not to be treated further here. Egerton and Laycock (1979) showed that 
even with that initial iteration one of the best-known iteration procedures can 
not converge to the global minimum. To be sure, in practical applications it 
is already valuable to have an improvement of curve-fitting in comparison 
with the initial estimation — as it would be in reaching a local minimum. But 
if such a local improvement does not suffice, the global minimum has to be 
determined. 

For this the method of Lagrange multipliers is suitable. The global minimum 
is under the stationary points of the Lagrange function with arguments 
D = (Uns %, A, %): 


lag (8) = Ue(y1, 2) + A's (tts 2) + x'p() (124) 


in case that it exists and the functions are sufficiently smooth, where 2, x 
are vectors of Lagrange multipliers. The stationary points result as solution 
of the following system of equations: 


O = yb, + WN On8(ny + #' OnP s 
O= 0), + 1 0,89); (125) 
0 = %m(u,2), 0= pln). 
For explicit models the stationary points result from the equations" 
0= OM wlE (ns I); mn) , 


(126) 
O = ,,1(m(E(m, %)> 2). 
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In general, such equations are as difficult to handle as are minimization pro- 
blems. They also have to be solved iteratively. For polynomial models which 
are of great practical importance, one can fall back on more precise procedures 
to define the roots. That is why it is of interest to describe more precisely 
these equations for important special cases and to show simplifications. 

Starting from the general equations we want to give their form for the com- 
putation of WLSE. In this case 


L(u) = |lz — pllo—n- 


With that, for explicit models the following system of equations is obtained 
(cf. Britt and Luecke, 1973): 


z= pt QA SA, 
O = 04 5(nA + Onpx, (127) 
0 = sm(u,2), 0 = ple). 
For explicit models with » = 0 a more simple form follows: 
=a, OU iG, ot) = O'0, Oar i(E is) 110. on (128) 
0 = 2z,,U(u(é, x)) = C’Q-1 Diag oa: 8;,ri(€i, 2)])- 


In a slightly differing form these equations are also to be found in Dolby (1972) 
ford, = 2. 

Concluding this problem, we deal with the case of independent replications 
of observations of a fixed experimental design since it is of outstanding prac- 
tical importance. Let 


Zig = Bit Ciy | aed Ohare LF 


(129) 
Dei; = Dine 
Then (cf. Rao, 1973, 8a, 5.4) we have 
L(u) = > > ley — ells 
i=1 j=1 
= |2(n). — “lla + DY tr (2;78)) (130) 
i=1 
with 
Q = Diag (2;/m;) 
fee (131) 


18* 
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and with the natural partition of 27+ into 2;, etc., and with 6;. = (&%. — &i), 
&, = 9. — ri(éi, 2), one obtains 


Ym, O47 (2;, + L8e;,.) = 0, 
= (132) 


(mi(2ers( EH, + 2,8) + 28s, + Z88;,)) = 0. 


It is practically and theoretically important that not only the true experi- 
mental design is identifying, but also its estimator, at least for almost all obser- 
vations. Of course, this depends mainly on the distribution of observations, 
but also on the estimation procedure and on the structure bundle. Moreover, 
this identification property is not only important for the minimization esti- 
mator itself. 

Practical computation of minimization estimators is always performed 
iteratively. But no iteration procedure can yield practically utilizable conver- 
gence to the global minimum — the desired minimization estimator. However, 
modern iteration procedures provide convergence at least towards local minima 
for arbitrary initial estimates. In comparison to older, less effective algorithms, 
‘this is an improvement because the latter converge locally, but in general they 
can oscillate or diverge. That is why, beside the minimization estimator itself, 
all other critical points of the estimation functional should uniquely deter- 
mine a state surface as well. The state surfaces belonging to different critical 
points may be different. If this is the case, then, in general any critical point 
approached by an iteration procedure will determine exactly one system para- 
meter, which we will take to be the estimator. Otherwise, if there were two or 
more state manifolds containing the estimated design, difficulties would arise 
in the interpretation of the curve fitting. 


Definition 3.2.1 Let the estimation functional 1,(-) be given. For almost all 
observations zy) let all of the — possibly different — l-minimizing estimates f(z) 
be identifying for the system parameter. Then we say that the structure bundle Sy 
has the property of identifying extremal points w.r.t. the estimation functional 1. 
If this holds for all critical points (not just local minima) of the estimation func- 
tional, then Sy ts said to be identifiable by the critical points of 1. 


Consequently, after the description of identifying experimental designs in 
Section 3.1.4 it must be a further aim to give practically checkable conditions 
to the system equations, which secure this property. It is to be seen that, essen- 
tially, these are the same ones that are already valid for the identifying experi- 
mental designs. For ‘genuine’ errors-in-variables models in which at least one 
of the ‘independent’ state variables is observed with errors, the following 
statement is obtained: if different state manifolds have no weak contact of 
infinite order and » > 2 dim JJ, then almost all observations 2,) provide 
WLSE A(z) of the experimental design, which identify the structural para- 
meter (cf. Héschel, 1978b, 1986). 
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~~ 


3.2.7 Models with unknown error covariance 
under normally distributed errors 


In such models the MLE result from the minimization problem described in 
(42), (43). The MLE is obtained for z = 2m) = (zi); with i= 1,...,2, 
fj == V,oen, My {rom 


L,(?) = min min min1,(u, x, Qy)). (133). 
) mE] BESy, =ver 

Now the minimization over 2 can be carried out for several important special 
cases. As it is to be seen the resulting estimates Q = Q(z, u, z) are determined 
uniquely. In those stationary points of log 1,(y) which provide the minimum Q 
has then to have exactly this value. On the other hand, the stationary points 
of log 1,(y) are obtained from the same normal equations as in Section 3.2.6, 
where an additional equation still occurs for 2. But the latter can immediately 
be eliminated by inserting Q. 


Theorem 3.2.10 The normal equations with unknown covariance are obtained 
from those with known covariance by inserting the estimator © in the latter. I 


Now we describe more precisely 2 for important models, starting with the 
practically less important. 


Model 1. Independent replications of the observation of a fixed experimental 
design, with 


Q2=1@ 2, @ = (2;)ja1... bo ieee = (Cie lee 


(134) 
CD gen Oe | heginer ew Be 
One obtains 
1, = k log det [Q] + tr Q-8¢, 
ke (135) 
Se = Lem; — #) mi — )’- 
j=l 
Then we have (cf. [A 3.14]) 
Q = Sz/k, (136) 


if Sis regular. This is almost surely the case for k > np. The most important 
case in practice is the following. 


Model 2. Independent errors over different experimental points with different 
covariance: 


C= Ding lat @ 2) ie Te Mg ty aya ys X,))~¢ (187) 
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Then we have 


k — > (m; log det [2] Si tr 2; Si), 
a (138) 


Si = Li (iy — mi) (iy — mi)’ 


n 


It results (cf. [A 3.14]) 
De => S;/m;, (139) 
in case that m; > d;. 


Model 3. Independent observations with equal covariance: 


= Ts, m= ym, 
(140) 
SeM eek == A 2 ae 


Then (cf. [A 3.14]) 
S=Sim, S=D (ej — mi) 2; — ma)’, (141) 
4) 


in case that m > n > d,, therefore at least one m; > 1. The maximum likeli- 
hood equations have a somewhat different form if data vectors sorted by the 
variables Z(m) = [2(}?,, ..., 2\?] are used instead of the object-sorted z = (2;),,...m 
(cf. Dolby and Freeman, 1975). 

Finally we remark that with missing replications of observations, similar _ 
problems occur as with linear models. Instead of equation (3.1.64), some in- 
equalities occur which are functions of the solutions of the likelihood equations 
(Dolby and Lipton, 1972, 1.(3)). In general these inequalities contradict the 
consistency of the solutions of the likelihood equations. 


3.3 Further estimation procedures 


For the present the MLE — and the WLSE connected with it — have been 
presented for the estimation of general functional relations since these esti- 
mators are especially suitable for a great number of practical applications. 
Above all this is true for problems of data-fitting in the field of natural and 
industrial sciences. There a fixed experimental design can be observed repeat- 
edly, and in a good approximation the errors of the observation can be assumed 
to be normally distributed and mostly even as independent componentwise. 
In these cases the WLSE and MLE provide consistent estimators which under 
certain regularity assumptions, because of the fixed number of parameters, 
also have the asymptotic optimality properties of the MLE (cf. Section 3.5.1). 
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From the preceding sections it has become obvious that the application 
of MLE is limited. Hereby we do not think so much of the fundamental pro- 
blems of MLE (cf. Weiss and Wolfowitz, 1974), but rather of the many ones 
resulting with nonreplicated measurements, with nonnormally distributed or 
dependent errors or random experimental design. Furthermore, also for models 
in which the MLE is consistent and asymptotically optimal, imperfections 
of the MLE such as nonexisting moments, bias or complicated ways of compu- 
tation give rise to the construction of further estimators. 

Now the present section is to give a survey of alternatives to MLE and 
explain the motives, principal rules of computation, as well as possibilities 
and limits of these alternatives. The alternatives often result from asymptotic 
considerations; however, their introduction will not be based on those details. 
A more detailed representation of asymptotic properties is to be found in 
Section 3.5. Moreover, a very detailed representation of asymptotics for a 
special but very large class of estimations on LIFU* with independent errors 
of observation over the single experimental points is to be found in Section 3.4. 


3.3.1 Linear functional relations with independent errors 


3.3.1.1 Introduction 


For the present, MLE and WLSE provide consistent estimates if the number 
of incidental parameters remains finite, that is, in particular, for a constant 
experimental design. This also holds for more general cases if the sequence of 
experimental designs satisfies specific conditions (cf. Section 3.4). It is often 
possible to get additional information about the model by observing further 
variables. The use of this additional information also yields consistency, if one 
assumes less about the sequence of experimental designs than on the correspon- 
ding initial model for the computation of WLSE. This is especially true for 
models where no replicated observations can be assumed (cf. Section 3.2.1.3). 
Of course, we are also interested in estimates for nonnormal models with 
random experimental design. However, the MLE is hard to compute in such 
cases. For clearness the following representation will be carried through mostly 
for bivariate LIFU. Then, in most cases the generalizations to multivariate 
LIFU are obvious. 


3.3.1.2 Ordinary and orthogonal least squares estimation 


First note that the OLSE and orthogonal LSE (cf. Section 3.2.2) are widely 
applicable because of their simplicity (cf. Section 3.1.2). One has to remember, 
however, that the OLSE is inconsistent and the variance does not exist for the 
orthogonal LSE. For the present we do not assume replicated observations. 
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Then it holds that 


Bo a poten = dny|dz 5 (1) 
((d, —d,) + VG, — 4, + 4@,)/2d,, for dy + 0 
B, = Boruss =40 for dy =0 and d,>d, (2) 


for d,, = 0 and d,.< d, 

(cf. Madansky, 1959). 
For observations distributed according to 2; ~ N([é;, « + Bé;], X), the likeli- 
hood function is unbounded (cf. Anderson and Rubin, 1956, p. 130). Although 
there is no MLE as alternative, these estimates can be compared. Using the 
MSE as criterion fy) has to be preferred since 6, — in opposition to By — 
has no finite moments. But to compare the estimates one can use the 
probability of that these estimators will fall in an interval of fixed length wich 
contains the true parameter. This problem will be treated more precisely in 
Sections 3.5.2 and 3.5.3. 

For practical purposes another representation of the ORLSE for bivariate 
LIFU is useful. Namely, if 


g := tan-* p (3) 
then 
tan 2p ,/2 aa Dny|(d x a dy) (4) 


holds (cf. e.g. Malinvaud, 1956). 

Starting from this ORLSE can be modified by minimizing not the sum of 
orthogonal distances but the weighted one. That is, let m; be the weights for 
the observations z;, then the weighted ORLSE (cf. Ware, 1972) By is given by 


Lila; —%) (yi — 9.) 
tan 29y/2 = —>—__—_—_—_——_—_; (5) 
Get) Ye) | 

The essence of this construction corresponds to the step from OLSE to GLSE 
in the regression model. 


If the weightings come from replicated observations, then we have 
tan 2>y/2 = 6b,,/(b, — by). (6) 


These estimates remain consistently and asymptotically normally distributed 
under the assumptions explained in detail in Section 3.4. 


3.3.1.3 Intrumental variables 


The following grouping estimator due to Wald (1940) is still more simply defined 
than the OLSE and geometrically as clear as the ORLSE. It is an advantage 
of this estimator to be simple as well as equivariant. Namely, the observations 
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are split into two groups — 2;;,,7 = 1,2; 7 = 1,..., m/2 for even m — and the 
rl is drawn through the means of both halves of observations (cf. Figure 3.3.1). 
us let 


Be = (Ye — 1.)/(%2. — a ,)- (7) 
Bartlett (1949) showed that for equidistant errorless observed &;, a smaller 


variance can be obtained by omitting the middle third of the observations. 
The condition for consistency (Wald, 1940) is 


lim inf |&,, — &,.| > 0 (8) 


n—>Co 


Fig. 3.3.1 


This condition is obvious. The corresponding sample means for the experi- 
mental design points have to differ for the groups. If the §;; are randomly distri- 
buted, the expectations of their generating distributions have to be different 
for §,; and &,;. All these estimators are a special case of 


A= DY wiyi! Luiwi- (9) 


In Wald’s estimate we have uw; = +1 according to whether z; is from the 
first or second group: 

Now the variables u; must not come from a grouping criterion which is 
independent of the observations. It might also be further variables which 
work in the system described by the LIFU but which do not occur in the LIFU 
itself and which are anyhow connected with the unknown experimental design. 
That means for these variables that they are correlated with the unknown 
experimental design. 

Let the empirical correlation be different from zero: 


fee OF (10) 
n—co 1 j=1 


On the other hand, these so-called instrumental variables (IV) u; are to be 
independent of the observational errors §;. 
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That means that, almost surely, 
1 n 

n—>oo 1 j=1 
Then on the assumption of (11) one can easily show that (10) is necessary 
and sufficient for strong consistence of the instrumental variables estimate 
(IVE) By. (Wald’s condition (8) is just equivalent to (10) for two groups.) 
The question whether a variable u could be suitable as IV or not has to be 
decided for each estimation problem separately. In applications from human 
and social sciences such variables are used more frequently than in natural or 
industrial sciences. Nevertheless, instrumental variables can be obtained 
sometimes in the latter. An example is the observation of a moving object 
at discrete time points. Then the ranks 7; of €; are known as instrumental 
variables and, for instance, any of the estimates 


Bee. == (Yr, 7 Yr.) (Xr, mae L,.)> <7 (12) 


can be used or the mean of pug or their median. This problem was investigated 
by Ware (1972). In this model he proved asymptotic normality for some IVE. 
This is treated in Section 3.4 in a more general context. The IVE is consistent, 
but with finite samples there is the disadvantage that every IV necessarily 
enlarges the variance. With errorless measured variables the OLSE is to be 
preferred anyhow. For that reason Feldstein (1974) proposed to combine both 
estimates by 


br =obo+(1—«) by, - OSaS1, (13) 


where the MSE is to be minimized over «. Moreover, a pretest estimate is given 
there: 


Bos if Qed . 
Kal (Gree BY am a 

where 
Q := MSE (6,)/MSE (,,). (15) 


Of course, the question arises whether a ‘more natural’ correction of the OLSE 
could provide a consistent estimate. This is possible if there is an estimate of 
the error covariance niatrix, which is independent of the errors $,,). For in- 
stance, w,(m —-n) is a consistent unbiased estimate of D(§) + D(d), where 
the expression 

D(§) = lim = Sz < 00 (16) 


n—>co 1 
is defined both for non-random and for random §;. Then 
B = (wzy|(m — m) — 65.)/(wz](m — m) — 65) (17) 


is a consistent estimate. 
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3.3.1.4 Use of variance components 


The modification given in (17) is essentially based on a consideration of the 
expectations of the corresponding sum of squares. This procedure was intro- 


duced by Tukey (1951). For instance, it is easy to see that under replications, 
the estimators 


B = byy[(by — we) | (18) 
and 


B = (b, —wz)/Bry (19) 


are consistent. Further estimates of this kind are to be found in the survey 
of Madansky (1959) (see also Doff and Gurland, 1961a, b). 

Besides the estimators stated up to now, there exist further ones for bivariate 
LIFU which are less applied. Among them are the estimators developed by 
Neyman, Scott, Kiefer, and Wolfowitz, as well as the method of cumulants and 
others (cf. Section 3.1.6). - 


3.3.1.5 Limited-information maximum likelihood 
and two-stage least squares estimators 


The consistency obtained with the OLSE by modification causes the thought 
whether modifications of the MLE also yield better estimates. We can start 
from equation (3.2.119). For g = 1, a modification is obtained if the smallest 
eigenvalues of Sz wQz.u.y are replaced by an arbitrary fixed or random con- 
stant 2 € IR?: 


6 = (Qx.u.v — ASyx.w)* (Qxy.u.v — ASxy.w)- (20) 


If 2 = 4,(S7yQx.v.v), the MLE is obtained, which in econometrics is also 
known as MLE under limited information (LIML). For 2 = 0 the 2SLS-esti- 
mate is obtained. Finally, according to (3.2.107), for A = 1 there results the 
estimate Sy')Sxry.y which in the case V = 0 becomes the OLSE in the LIFU* 
if it is considered as a linear multivariate regression model. Because it is im- 
portant in the literature concerning linear simultaneous equations, the original 
construction of the 2SLS-estimate and LIML will be explained here. Thus the 
relations between linear simultaneous equations and LIFU* are examined 
from another aspect. For the present we consider the case V = 0, U +I, 
(cf. (3.1.37)). Then M can be estimated by the OLSE M with U as regressor 
in the first step. For 2 = I ® 2 we have 


~ 


W2 TP. VORM SMe ODM = (GU)2A@z (21) 
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(cf. Bunke and Bunke, 1986, (251515), with Ci CX XA A) ue for 
explicit LIFU we have 


AYE = [é:, Be tae 


M=(Xj%],  XEMooxn (22) 


Y = BX +3 
Dé = (UU’)1 @ (—B | I,) S{|—B'l,] =: A @ 2;: 


In the second step the 2SLS-estimate is obtained if the LIFU* (22) is inter- 
preted as regression model. In this model we have U = I,, corresponding to 
(3.1.37). With that the 2SLS-estimator is 


B= (KA2X') FAY’ = OF yOQxv.v (23) 


and for g = 1 there results the shape of the 2SLS-estimator already derived 
from (20). Now (23) is defined also for g > 1, but then it can not be brought 
into relation to other eigenvalue estimators. 

The LIML, too, was originally constructed starting from (22). For this 
a two-stage WLSE was used in (22), in which a criterion with estimated co- 
variance 2, was taken instead of the original least squares criterion belonging 
to model (22): 


te [27 (¥ — BX) AY BX)] (24) 


The corresponding estimator 2; of X; is based on one of Z. Starting from (21) 
we choose 


pes Dz_ su = Szy (25) 


as an estimator of 2. 
(Then 5/(m — n) is an unbiased estimator of 2.) 
Then, with the help of 


435 ee T,) 2[—B' + 1, (26) 
the LIML is defined as 
min tr [((—B}J,) S| —B’ t Iq) (¥ — BX) A-U¥ — BX)’. (27) 
- 
Because of MA-1M = ZPyZ' = Qz.v, this is equivalent to 
min [{tr PZ254Qz.y27""}] 
B 


with = R([SY?—B’ : I,]). The problem was already treated for 
ft = R([—B' ; I,]) = RL") 
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in Section 3.23. Thus the matrix L+ is obtained from the eigenvectors belonging 
to the q smallest eigenvalues of S71,Q7.y. With the transformation described 
in Example 3.2.2, Section 3.2.5, one obtains B. According to (3.2.119) this is 
just the estimate constructed as LIML, in which f+ and B are connected due 
to (3.2.116). But take into consideration that for U = I,, V = 0, the MLS 
arises for LIFU* without replications. The MLS provides only a saddle-point 
for the likelihood function which is unbounded (cf. Section 3.2.1.3). As above, 
the case V + 0 is included in a natural way. Q7.y is replaced by Qz.y.y and 
Sz.y by Sz.w. Starting from (22), Theil (1958) constructed the k-class estimators. 
Those we obtain from (20) for fixed 2 and k = 1 — 4. For q = 1 the known 
representations from the field of econometrics result if in Sy. yw, etc., the corres- 
ponding projectors are written. For instance, the 2SLS-estimator is obtained 
from 


bos = (XP,X')-1 (XB,Y’), 
Where. Pra J — Py, — (1 — 1) (L — Py): 


Under the assumption U’V = 0, which is possible without loss of generality 
according to (3.2.81), it results Py, — Py, = Py. Hence the representation 
commonly used in econometry is obtained (cf. Farebrother, 1976). 

However, even the so defined k-class estimator can not be applied without 
problems. Their most important representatives MLE and 2SLS-estimator 
have no finite moments of higher order (Mariano and Sawa, 1972, 4). The 
same holds for all k-class estimators with fixed constant k (Sawa, 1972). Thus 
the variances of these estimators also do not exist. Consequently, their accuracy 
can not be compared on the basis of their variance, which is a simple measure 
of concentration. Of course, there are justified arguments against comparing 
only estimates with existing variance (Anderson, 1976, p. 8). 

For instance, the probability of falling into an interval around the true 
parameter might also be chosen as a very informative concentration measure. 
An enlarged theory in this direction was developed for models with a finite 
number of parameters (cf. Weiss and Wolfowitz, 1977). For LIFU* where the 
number of parameters increases, asymptotic results were obtained for the 
mentioned estimates (cf. Section 3.5.2). 


3.3.1.6 Modified maximum likelihood estimation 


Of course, one also wants to have simple estimates available with finite variance. 
For that reason Fuller (1977) investigated modifications of the MLE and 2SLS 
estimator which have moments of higher order for sufficiently great sample size. 

It is clear that the moments of the estimators defined in (20) do not exist 
if the inverse of the random matrix Qy.y.y — Sx.w becomes ‘too often too 
great’ — that is, the matrix itself becomes ‘too often too small’. This happens 
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if 2 is ‘too often too great’. Thus a perturbation of 2 towards smaller values 
could secure the existence of the moments. In fact this holds fos the following 
modifications: the MLE becomes a modified MLE, 


bum a (Qx.vv — ASx.w)? (Qxv.0.v — ASxy.w)s (28) 
A= 4,(SzwQz.0.v) — alm(m —n), «> 0. 
The modified 2SLS estimator is 
bas = S¢/Sxv (29) 
with 
Sx =Qx.0v —AS8x0,  Sxy = Qxv.u.v — ASx.w (30) 


and with g := 1 and 4, = 4,(Sx,wQy.v.v); 


(iyi = = e2 
n je ge et fo pieces 4 Bist Gea 
7 m(m — n) m(m — n) 
| A ie Matai otherwise. 
r m(m — n) 


Noth these modifications have the same bias up to order O(m-?) (Fuller, 
1977). 

Such modifications are formally defined also for g > 1. Now, for the MLE 
of B it has to be taken into consideration that their explicit representation in 
the form (20) is possible only for g = 1 (cf. Example 3.2.2 in Section 3.2.5). 
The substance of the modification in (28) consists in the perturbation of Sz w 
towards smaller values, whereas the modification of the 2SLS estimator arose 
directly from (20). The comparison of both the modifications is facilitated 
because of their asymptotic equivalent bias. With that, for g > 1, two diffe- 
rent possibilities of generalization are obtained. 


A modified MLE for #* results as the eigenspace to the q smallest eigenvalues 
of 


S7w(Qz.u.v + AS, Wwe 


where 4 is an arbitrary positive constant. On the other hand, a modified k-class 
estimator can be obtained for £ = A((I, B]) from (30) if 4 is an arbitrary fixed 
or random number 2 = Ke m))- From (30), for g=1 the MLE result for 
1 = A,(SzwQz.v.v), whereas this is not the case for gq > 1. With the help of 
another approach it will be shown in Section 3.4 that for g > 1, too, a natural 
connection can be established between MLE, 2SLS, and other estimators. 


| 
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3.3.2 Linear functional relations with dependent errors 


Dependent errors most frequently occur in econometric time-series models, 
and less with problems of so-called data-fitting in natural and industrial sciences. 
But the treatment of such dynamic models is beyond the scope of this intro- 
duction. On the other hand, with ‘pure’ data-fitting problems there will be 
situations from time to time where dependences among the observations can 
not be excluded. Such dependences may arise from the experimental design 
as well as from the observations: These models must not include such ‘strict? 
and explicit dependences as in time-series models from the first. For that 
reason it is desirable to have a praticable consistent estimator available for 
such problems in which less is known about the mechanism of the rise of ex- 
perimental design and of the process of observation. Of course, this method 
could also be applied on time-series models. Such an estimator was constructed 
by Robinson (1977). In particular, it includes regression models with independent 
errors. For the present a heuristic introduction of this estimator is given. 

We start from a LIFU~ where the stochastic quantities §;, dj, €; are all 
together independent and identically distributed in each case and have vanish- 
ing expectation. If one constructs the estimator as usual on the basis of the 
second sample moments — that is from Sz — then, in the limit as n — oo one 
has the covariance matrix of Z available to construct the consistent estimator 
(ef. also Section 3.2.1): 


pp. (DOF PO PBB 
BD) De + BD(§) B') 


(31) 


(For fixed experimental design one must demand the existence of the limit 
Sz > D = D; < o. This demand is obvious since otherwise the experimental 
design would be too dispersed. In that case no consistent estimator would be 
possible on the basis of exclusively using the second sample moments Sz since 
the convergence of Sz would be necessary.) 

But now, with the knowledge of the p(p + 1)/2 elements of DZ, in the limit 
it would not be possible to determine the elements of B, D§, Dd, Dé since the 
total number of the parameters g(p — g) + (p — 1) (p —¢4+ 1) + 4(¢ 4+ 1)/2. 
exceeds this number of known second sample moments by (p — ¢) (p — q + 1)/2 
This indeterminacy was already reflected by the identifiability statements of 
Theorem 3.1.1. Therefore in the model one has to assume additional informa- 
tion just corresponding to the amount of information that is wanting. 

For this, different possibilities have been discussed already: the utilization 
of information from higher-order moments, instrumental variables, replicated 
observations, ect. These methods start from a complete indeterminacy of all 
parameters. But there are cases in which sufficiently many parameters are 
exactly known; for example, the model with known variance quotients in 
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LIFU- is a classical one. In many practical problems it can be assumed that 
Dé and Dé are diagonal. This is true for many technical curve-fitting problems 
where the measurement errors of the single variables are independent. Or, 
perhaps it can be assumed that the kth component of 7 is not influenced by the 
Ith of &, therefore that b,, = 0 for some J. Finally, it could be known that 
several variables are measured with the same instrument, in which case the 
variance of the measurement errors would be taken as constant and the cor- 
responding variances Dd“) would be equal. 

Now one has to check what this additional information implies for identi- 
fiability. For the present, from 2’, due to (31) under fixed Dd one always can 
determine D§. Here an equation is D§ = Da — Dd. Further conditions might 
be given by an implicit equation 


0 = hy), py = [B, Dd, De] (32) 
(cf. Section 3.5.3) where in 
fi Ra (pg) g == Gp = 9)*, 8h (33) 


the trivial conditions of symmetry on 2; and 2, are to be included. Above that, 
at least (p — ¢) (p — ¢ + 1)/2 additional conditions are necessary. (But recall 
that the demand Dd, De > 0 implies the known inequalities for certain sub- 
determinants.) | 

In h and Sz, more information can be contained as necessary for the con- 
sistent estimate of parameters. Such an estimation problem can be solved by 
data-fitting and therefore by solving a corresponding maximization problem. 
For this, functions which are similar likelihood functions offer themselves for 
several reasons. One starts from a LIFU 


y; = Bu, + (e; — Boj) (34) 


formally written as a regression model. Inconsistency of OLSE (cf. Section 
3.1.2) was in essence based on the fact that the regressor a; and the error 
€, — Bd; are not independent. If one could succeed in generating a regression 
model by parameter transformation, in which the regressors and the errors are 
uncorrelated at least asymptotically, one would expect that the corresponding 
MLE from the regression model becomes consistent. For this, put B = B — B. 
Then 


yi = But 6, é; = &; — Bo; + Bx;. (35) 
Therefore, with the abbreviations Sy; := Sz) 4, ete:, we have 

Sx = St433~-35 + BSx,ax 

= Sz, + 83, — SB’ — SB’ + Seber 


(36) 
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In this equation the first three terms vanish asymptotically. If we put B’ 
= S;'D(d) B’, then this also holds for the last difference, and the desired form 
of a regression model results. Hence we form the regression model 


Yi =i Ba, a5 &;, 
B=B—B, B= BDds;, (37) 


Because of the independence of the §;, dj, &;, we have 
Dé; = De + (B — B) Dd(B — BY + BDEB'’ 
= De + BDdB' — 2BDdSy'DdB' 


+ BDIOS;'2,S;'DdB' . (38) 
If Dx is approximated by Sz, then, for D(é;) one obtains approximately 
Z = De + B(Dd — D(d) SyD0) B’. (39) 
We put 
y = (B, Dé, De). (40) 


One obtains as quasi-likelihood function for the regression model (37), 
L(y) = —log det [2] — tr S-18;,. (41) 


Robinson (1977) showed that —1, is a contrast function with parameters B, . 
From minimum contrast estimates we know their consistency (e.g. Strasser, 
1973) and asymptotic normality under additional assumptions. The essence of 
the procedure developed by Robinson (1977) consists in showing the conver- 
gence of S; toward the covariance given in (31) under weaker conditions than 
that of simple independence of the §;, d;, €; and then in utilizing the properties 
of the function 7/,(y). 

Now we give the construction of the estimator in a comprehensive form. 
The proof of consistency, some questions connected with asymptotic normality, 
and further explanations will follow in Section 3.5. 


By %, we denote the sequence of random variables dj, &;, Si, k <n, and by 
B5,, B,., those sequences which in addition contain (&,, § 1), (bn» §n) respectively. 
The model assumptions used until now will be weakened for explicit LIFU: 


ni = Bos, a= & + 6;, Yi = Ni + Fi (42) 
| Assumption Al Let, for n > 2 almost surely, 
E(O,|Bs,)=9, Elen | B.,) = 9; (43) 


19 Nonlinear Regression 
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E(0,0;, | Bs,) = 23 < 0, (44) 
EEE, | B,,,) aa 224 <0, as (45) 


where the T,, may be nonrandom or random (dependent) matrices with the following 
convergence property : 


YS Tn Ty <0 (47) 
i=1 
(Remark: T, must no be regular!) 


Assumption A2 For some c> 2 there is a constant K < co such that for all 
n= 1: 
Ells, Bild, Eile ll’ < K. (48) 


Assumption A3 Let 2, := 2;,+ 7) be regular. (We note in particular that 
the case 0 = 2, = Dé is possible, which gives the usual linear regression model 
with correlated errors.) We put 


y := (B, 2;, Z,) € R° (49) 


for BE Magy (pqs 25 © Mip—g) x(p-q): Le € Mgyq: Let h be a (sufficiently often) 
- continuously differentiable function in yp: 


Wiery tere Cee Peers (50) 
For fixed T we put X, := 2; + T 

Ag Rises a Aa (51) 

Q = Ay) + BX; — X;2 7125) B’ + Q, Qy = Q(yw). (52) 
Assumption A4 Let w be a compact set of parameters wy € IR* for which 

A(y) = 0, (53) 

Ay) > 0 (54) 


holds and let the ‘true’ parameter po = (Boos bes 5.) be contained in y. 
Furthermore, let 


S(y) = Sy_zx = Sy — ASee i SrA ala AS,A’, (55) 
A = Aly) = BUI — 3;S¥), (56) 
O = Qy) = BS; — Z,SP2;) B’ + £, (57) 


3.3. Further estimation procedures 291 


and Ki 
L(y) = —log det [Q] — tr Q-*8(y). (58) 


Robinson (1974) defined that value > in y as the estimator of the parameter 
which maximizes 1,(y). 


3.3.3 Nonlinear models with independent errors 
3.3.3.1 Modified least squares estimation 


As with linear models, there are also some possibilities for nonlinear models to 
modify the available estimators. Thereby one has to consider that the complete 
solution of the nonlinear equation systems developed in Section 3.2 computing 
MLE and WLSE may be very difficult since the dimension increases pro- 

~portionally to the number of observations. That is why one needs algorithms 
which are more easy to compute for such sample size which is to be considered 
small in practical problems. Such alternatives are based on linearizations of 
the starting problem and utilization of the first iterations for the WLSE start- 
ing from an initial estimate. General iteration procedures will be trated in 
more detail in Section 3.8.3. In the present section a modified Gauss-Newton 
procedure is discussed in detail. Thereby, starting from a consistent initial 
estimate 2, for a one obtains asymptotic normality also under increasing 
experimental design (cf. Section 3.5.4) with the first iteration z,. This modifi- 
cation of the Gauss-Newton procedure was developed by Fuller and Wolter 
(1982) for d, = 1 and is based on the paper of Villegas (1969) in which replicated 
observations of a fixed experimental design are treated. We consider the 
explicit sequence model for m = 1, 2,..., 


qi0o = ri(Eio; To) » p= 1; see) Um> (59) 
Lim = Fig an Simos Do im = Pp (60) 
Yim = Nio + €imo- (61) 


We putd, = q, d, = p. The size m, of the experimental design increases with m: 
MN = Nala (62) 


where n,, = n(m), lm = Um). 
Assume that is known. Let 


has be) > OF Lt = kp, = 0(m-12), (63) 
Then one obtains the relation: 
n= m-k =o(mi?), (64) 


19% 


N 
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Then, because of Z,, + 0 it also holds that zim — u;. This model contains the 
case of independent replications of a fixed experimental design with known 
covariance D$ = Z. Here n is fixed and for 1 = 1,...,n; mj = ky! = Imp, it 
holds that 

km = nlm = O(m-) = o(m-¥?2) (65) 
and 

Zim = i» Doin aa Zl km = mins 
An initial estimator 2, of a» with 

m, — m= Op(n-¥?) (66) 


is the assumption of the procedure. Under weak assumptions such an estimate 
is the OLSE in the nonlinear regression model 


Yin = TA Lem» It) + Stn 
cf. Section 3.5), where the x;, are used instead of the &j9. In the case of re- 
plications one would have Yim = Yi. Lim = %j,- 
Starting from 7, one constructs an iteration 2, similar to the Gauss-Newton 
one. For this iteration asymptotic normality can be proved. 

With 2, an estimator ,), of the experimental design p;,), which lies on 

Sy,z 18 obtained by 

Le(M(n)1) = |l2(n) — Mnyrllo-2 = min |[%n) — Mmllo-a- (67) 

HE wry 


The estimator ,), is one of the stationary points of 1, which, because of 
Q=1©®& 2, result from the equations 


0 = Gl, = Ord (yi — ra) + 2°(yi — ri) 
iP Ord” (x; — &) + 2%"; — &,) (68) 
where 2,,! was partitioned into the blocks 2° etc., and 0,74, := O¢7;(&i,, ™). 


From the solutions of (68) one chooses those ones which minimize (67). With 
the Taylor-series expansion 


i= Ta + Orin(% — mm) + Orn(E; — fn) + BR (69) 
we obtain the following approximation of [,(u) with C(n); = 2n) — Mnyjs 


An; = x —7;,j = 0, 1 Cer AR pe 


L(, 0) = lle(ny — (ayn — (4 — Hnyi)llo— 

= UAE ays, Am). (70) 
Therefore the WLSE &,% is to be obtained approximately by the following 
relation: j 

1,(%, E(m)) = min 1,(2, E(m) © U(Aym, 49é) := min 1(Am, Agyi)- (71) 


%,P(n) Am, 48 ()1 
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Then the second iteration 2, =: gy (GN for Gauss-Newton) for the WLSE # 
results from 


Ne = 1 + Aan. (72) 


The point ¢ny2 := ([Ei2. Ni2])i=1,...n = ([E:e, ri(Fie, 72)])i-1,....n is contained on 
Sy,x,- But with repeated application of (67) a ‘better’ estimate of the experi- 
mental design is obtained. The solution of (71) is obtained as in the linear 
regression model. With 0 = O,y_,),s,for fixed Am, the minimum over A€;, is 
the solution of 


pare Os Orin] A2é(47%) = Py 2 V2Cy, = [0} Oni] Am) (73) 
with 

£5 = KEV [L pg, Ori). (74) 
From the last equation it follows that 


Lf = AZ| —Ofry | Iy)). (75) 


Therefore 4,7 results from the extremal problem 


min ¥||Prx(Z"(Cq — [0 | Ara] Amy)lB,. (76) 


An, t=1 


This is a least squares minimization as known from linear regression theory. 
The solution A,z results from 


n 
Aan = gn — % = 2a DY Own 2 (Eu — Oe7nSin) (77) 
i=1 
with 
D2 — DS, (hry dq ra), (78) 
i=1 
Sa = F (om, Fa) = (—Ogrin tq) Zl —Opran | iy). (79) 


As one can show we obtain from (70) 
Sian — Sn = Zi"(L | Geran) Zizi — (Ea fri + Saran» Arm). (80) 


In these normal equation only the (d, x d,)-matrix », and the (q X q)-matrix 
S, are to be inverted. The computational effort corresponds to corresponding 
iterations with the solution of the WLSE equations as described in Section 
3.2.6. In the iteration procedure proposed here one has an estimate pw € Syx, 
of the experimental design at every stage. That must not be the case for the 
corresponding procedures which solve the normal equation of the WLSE for 
general implicit errors-in-variables models (EVM) (cf. Section 3.8.3). 
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3.3.3.2 Different covariances 


Finally we consider the model which results from the possibly unequal numbers 
of replications at the single experimental points. Then, in the general model 
(60) one has to put 


Doin af ain Se +3 ™;. (81) 


And 
Mo = mn 


is the average replication number in the experimental points. Now, all 
formulas remain unchanged if in the corresponding terms 2, is replaced by 
Xivm). In the corresponding model with replications and possibly different 
covariances we have 


Nm 
Diem) = XilMi, m= >i m;. (82) 
iat 


To obtain the desired asymptotic properties one has to demand that (63) and 
(64) hold for any point of the experimental design (cf. Section 3.5.5.): 


lim Xi(myMi = dj = 0, (83) 
m1 = 0(m- 2), tm = o(m2), (84) 


In contrast to (64) the second part of (84) is not a conclusion of the first part 
but an independent assumption. 


3.3.3.3 Unknown different covariances 


We show how this procedure is applicable and how one can obtain independent 
covariance estimates from the sample means of observations, in the case that 
replicated observations are available. In the model with 2 = Diag (I,,, ® 2%) 


Lim =Si(m,—1),° Sp= Da ty — Bi) (ey ee) (85) 
g=1 
are independent and unbiased estimators of the covariances 2; of the zim = Zi, 
The corresponding holds for 2 = I, @ X' witn 


£n=S/lm—n),  S=>S, (86) 


It is well known that these estimators converge to 2X; for m;—> oo, and to X 


for mM — %m => 00, respectively. 
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As one can see, the estimators f, which are obtained by the minimization of 
L= Zn). by Hla | 
On Diag (1, © Sian), , and Q= 1, os. 


(87) 


converge toward the same values as the estimators computed for fixed aim): 


3.3.4 Estimation with instrumental variables 
in linear functional relations 


3.3.4.1 Introduction 


In this section the estimation using instrumental variables, which has been 
treated for bivariate linear functional relations in Section 3.1.1, is to be in- 
vestigated for a general multivariate model. We start from fixing a general 
model of a linear functional relationship with nonrandom unobservable 
variables (LIFU*). For this we first prove some statements on parametrization, 
in connection with models considered previously (in Section 3.1.3.3) and on the 
form of the maximum likelihood estimator (under normal distribution). Re- 
sults and notation of this section are fundamental for the asymptotic treatment 
of the model to be presented later on (in Section 3.4). 

Then, in such a general model we consider estimation using instrumental 
variables (IV). Hereby the connection between the LIFU*t model under con- 
sideration and a corresponding model of a linear functional relationship with 
random unobservable variables (LIFU~) has been found to be essential. In 
such a model we at first represent the [V-estimator as a maximum likelihood 
estimator in a LIFU~ model (under normal distribution) enlarged by inclusion 
of some random IV. Then this result serves as a heuristic basis for the con- 
struction of the [V-estimator in the LIFU* model (for nonrandom unobservable 
and instrumental variables). A special 1V-estimator in the model considered 
will be defined as a maximum likelihood estimator (under normal distribution) 
for a certain (possibly inadequate) model setup. This definition is justified by 
the preceding consideration; the MLE results as a special case of the general 
IV-estimator so determined. The asymptotic properties of the so obtained 
canonical instrumental variable estimator (CIVE) later form the subject of 
Section 3.4. 

The models investigated in the sequel are always linear functional rela- 
tionships with random or nonrandom unobservable variables according to 
Definitions 3.1.6 and 3.1.5, respectively (including the implicit case). We 
always assume that the error variables S;,7—1,...,m are independent and 
identically distributed, and that.in the case of LIFU~ the same holds for the 
quantities u;, 7 = 1,...,m which are independent of $;, 7 = 1,..., m. In the 
occurring random variables the observation index will sometimes be suppressed. 
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For the properties of the set &,, of J-dimensional linear subspaces of R* 
confer [A 3.16]. For @; € &,, 7 = 1, 2, let £, + £2 denote the linear hull of 
L£,uU £2; £;.\ £, the orthogonal complement in £, of £, 0 £,; £4 the ortho- 
gonal complement in R* of £,. For measures P and linear mappings A let AP 
denote the image of P under A. 


3.3.4.2 A general model for linear functional relations with nonrandom 
unobservable variables 


Let 7,¢,7,mEN,G@SrSp,q<p,m=p—gq and alinearspaceJ,J € xX, 
be given. Furthermore, let #* be a set of probability distributions P over 
[IR?, 8?] with 


de — Onn eh far cts, 
and let 


M = {i € Meosem | R(M) € opie 


We consider a distribution model (LIFU*) for an observable random p K m 
matrix Z, described by the following relations: 


Z=M+S 
Mem 
c = (6i)s=1,..).moS ©) PS, 


Thus the observable matrix Z decomposes into the matrix § (with indepen- 
dently and identically distributed columns) and the parameter matrix M on 
which we have the information 2(M) € &, p-¢ 

We consider the problem of estimating the derived parameter 


:= R(M). 


The included distribution model # is as yet unspecified (except by the above 
assumptions). In the following we consider submodels of (LIFU*) and specifi- 
cations induced by further restrictions on M and certain assumptions on ?*. 
These models will be indicated by aggregation of corresponding symbols like 
(LIFU)*, (R), (N). 

In Section 3.1.3.2 (for the case r = p, i.e. of a regular DS) the interpretation 
of the model assumption R(M) € &,,,-, as a linear functional relationship was 
given, correspondingly that of the model (LIFU*) as a model with errors in 
variables. If 2(Df) is assumed known and different from IR? then (disregarding 
a set of measure zero in the sample space) the function Pzi1M of M becomes 
known also, in particular certain components of w;, 7 = 1, ...,m may become 
known (if M = (i)ia1,...m): 
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For the case of a general known J = (DS) considered here, we now introduce 
an additional assumption (R) to be immediately justified by an example: 
(R) Seis € Log | 4. Jd = R?). 


Example 3.3.1 (Inhomogeneous linear functional relationship) Let r,q,m € IN, 
r>q,m =r anda set P* of probability distributions P over [IR", 8*] with 


ear 0;, KR f aan’ dP) = Rt 
be given. Let 
M™* = {MEM , | SH ER! A(M — E14) € Q,,_,}. 


We consider a distribution model for an observable random (r X m) matrix Z, 
described by the following relations: 


Z, = M, + &* 


oF = CFint.ms S*OQP™ 


(for independent ¢7, 7 = 1,...,m). We consider the problem of estimating the 
parameter 


(£*, E) := (RM — El/,), B). 


If we put M, = (i);-1,..m> then the m;, 7 = 1,...,m lie on an unknown 
affine manifold and all components of 2; are only observable subject to error 
(since 7[D$*] = r). This model can be construed as a special case of (LIFU*), 
(R), if we pat p= 7 or 1,Z= (lj, | Ze], M = [Ui Mo), c = [01 xm $*], 


J = R([0) x tZ,]) 


and consider the resulting distribution model for Z. Namely, for L** € M,,q; 
R(L*1) = £* it holds that 


LPB TM == 0, a 
r[M] = dim &(M) = dim (—#} J,) R(M) + dim (1 Oy xr) R(M) 
= dim £* + dim AY) =r—qtl=p—gq. 
As r|[—E’ ve) Tie = q, we have 
£ = AM) = A —EF' ;1,] L*+)' = ((-#’ ABREU eee 


According to part (a) of [A 1.6], £ varies in Kt if (¥*, #) varies in ¥,,,-¢ XR’. 
With this the question of identifiability of the parameter (£*, H) is answered, 


_ too. Obviously £ is an identifiable parameter function (cf. Remark 3.3.5 be- 
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low), and using part (a) of [A 1.6] one easily shows that the equality w.r.t. 
f = ({[—EH I,| £**)+ is a maximal equivalence relation in the sense of Bunke 
and Bunke, 1986, definition 1.5.1). Consequently we can confine ourselves to 
estimating the parameter /. 


Obviously, for the case of regular DE (i.e. r = p), the general condition (R) 
represents no restriction. If (R) is violated, then 7[Z] < p holds almost surely 
for any observation size m and by transforming into 2(Z) a dimensionality 
reduction of the data becomes possible. 


Remark 3.3.1 It can be shown that (R) eliminates a set of parameters £ 
which is closed in %,,,-¢ and of measure zero (in the sense of the invariant 
measure in 2, ».-¢ (cf. [A 3.16]) 


give 


In this section we mainly consider the case of normally distributed errors: 
(N) PS IN, (0,, 2) |Z € MZ, AZ) = I}. 


We consider a further condition [Ex] which in the following leads to explicit 
models. Let a linear space Jo, Jo € &y,q with the property Jog S J be given. 
The condition is 


(Exes Ip IR? 


(Ex) thus implies (R); in the case r = q, (Ex) and (R) coincide. 
We assume that 


F = RO (p-1) xr I,}) 
(if 7 < p) and 
Iq = R([Op-a) xa I;}) 


Remark 3.3.2 In the model (LIFU*), (R), (N) or (LIFU*), (Ex), (N), this 
does not imply a restriction of generality. Indeed there is always an orthogonal 
transformation O € It,,,, such that OJ, OJ) have the above form; with 
transformed observation OZ and parameters OF one obtains a model in which 
the above assumption is satisfied. 


If (Ex) is assumed, then we speak of an explicit model. Indeed, according to 
part (b) of [A 1.6], (Ex) with the above form of Jy is equivalent to f+ 
= RAi([—B’  I,)) for a BE Myy(p-q), hence for M = (uj);-1,. m it holds that 


(—BHI,) us =, . ¢=1,...,m 
or 
Hn; = Bé,, — ale vey Mt (88) 
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for wu; = [&;| ni], 1; € R24, 7=1,...,m. The linear functional relationship 
fi € £ between the components of mu; can thus be written in the explicit form 
(88). 

A further model assumption considered in the following is 


(Aw) R(M") S wW € ae 


for a given n-dimensional linear subspace @ of R™, p —q¢ <n <™m. By this 
it becomes possible to take into account certain restrictions on M, i.e. the 
case where a model (LIFU*) is combined with a linear regression model. For 
this we write (A) if @ is given and fixed. 


Remark 3.3.3 For the model (LIFU*), (R), (A) being nonempty, it is ob- 


viously necessary that n = p —q. This condition is also sufficient: with 
LE Myx (pqs RL) = £, W EM) xm AW’) S W it holds that 


See 
IWeM, RW'L')/ EW. 
That is why we always assume n = p — q in what follows. 


Let us now introduce some further notation. 


(i) For A € M,,.,, k,l € NN, let 
D4 rt, 1 -=[—4'' i], o(A) := R(Ly). 
Obviously L4 Ly = 0,,.,and e(A) = (A(L4))+ are valid. The mapping 9 : U Myx, 
k,leN 


> U &,4,) is injective. For further description see part (a) of [A 3.16]. 
k,leNW 


(ii) According to part (b) of [A 1.6], (Ex) is equivalent to £ € e(Myx(p-q)); 
then let_ 


Bo 


This parametrization of £ is used in the explicit case. Furthermore, let 
the parameters B;, i = 1, 2 be defined by 


B= (B, 1 Ba), B, € Das pen By € Wax ira) 


(in the case g <1 < p;forr =qorr = p let B, := Bor By i= B respec- 
tively). 
(iii) In the case r < p let 


J = E+ 


Or x (p-r) ? 


ees 
J var, 1 Dae 


and for r = p let 
Bh owe 
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furthermore, let 


topline Os 


ax(p—g) * 


Dg 2= Longo? 
Therefore for J, Jy from (R), (Ex) it holds that 

JI=KRJ), Jy = RLh). 
Then, for BE Max (p-q) 

L3=1,+ LB, DS a Dik ad Oo (89) 


Let us now introduce distributional specifications (V,), » = 1, 2, 3 under which 
the model is to be investigated. Let Py be a set of probability distributions P 
over [IR?, $?] with 


JEP Se ine tl ad Pe Oye a | ek dita 
(iv) Fora LE M=, R(L) = J, let 


Ve = {2 € ME | X, = 07D, a? > O} 
Vg := {Z, € ME | R(Z;) = J}. 
Let the specifications (V,), » = 1, 2, 3 be given by 
(V Pre | ca), Pea Pe), y= 1,253. 
Let us introduce some further notation: 
(v) Me JM | Tf eit LY Lp ot*=—JS; 
in the case r < p let 
MJ 
(vi) For § from the determination of (LIFU*), X from (iv), let 
pS, ap id Lied. 
of = tr [2;]/tr [2}. 


Remark 3.3.4 In the model (LIFU*), in the case r < it almost surely holds 
that J1’Z = M, because of 


EJVZ = M,, DIVL = (In @ I") (Im ® Z) Im @ J*) 
ae Omm(p—r) x m(p-r) » 


i.e. the submatrix M, of M is observable. 
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Henceforth, if it is convenient, we will identify the model (LIFU*) with the 
resulting distribution model for Z, given M,. 


Remark 3.3.5 In the model (LIFU*+) M = EZ and # = R(AZ) are identi- 
fiable (for a distribution model {P,, 4 € O} a parameter function y:0 >T is 
called identifiable in the model if 


VO, 8 € O: Py = Py > y(8) = (8) (90) 


ef. Bunke and Bunke, 1986, definition 1.5.1). Hence this also holds in (LIFU* )5 
(R), and in the other submodels and specifications considered here. 


Remark 3.3.6 In the case r= q¢ the model (LIFU*), (R) is explicit, and in 
(88) the €;, 7 = 1, ...,m are observable (M, = (€;);_, ) according to Re- 
mark 3.3.4. Thus a linear regression model results. In the formal derivations 
we restrict ourselves to the case q <r < p as far as nothing different is said; 
all results carry over, with obvious simplifications, to the cases g = r and 


q = p, and sometimes they are used in this form. 


In the model (LIFU*), (R), (A), (N), (V,) one can consider the problem of 
maximum likelihood estimation of the parameters for vy = 1, 2, 3; the distri- 
bution of Z, has a Lebesgue density. In the following this is worked out in 
more detail. It turns out that for sufficiently large m (condition (B,)) there 
almost surely exists an MLE ?°) of # based on the continuous normal density, 
and also such one of the parameters o7, X;. For this the connection with the 
models considered in Section 3.2 is specified in the following. The results of 
Sections 3.2.3—3.2.5 will be used. 

In accordance with Section 3.2.2, one then obtains the weighted least squares 
estimator (WLSE) by suppressing assumption (N). Further generalization to 
the instrumental variable estimator is now achieved by cancelling requirement 
(A) in the distribution model for Z, but obtaining the estimator as WLSE from a 
formal model setup (LIFU)*, (R), (A), (V,). This setup is thus possibly inade- 
quate. The space @ used in restriction (A) represents the ‘instrumental va- 
riables’. 

An intuitive basis of this estimation procedure can be found in the connec- 
tion of the model (LIFUt) with a model with random unobservable variables 
(LIFU-). Roughly speaking, (A), i.e. M’ = W'K’ (W’E Mnxn, RW’) = W) 
can be replaced by M’ = W’K’ + ¢ provided that ¢ is asymptotically negligible 
in a certain sense. This corresponds to a condition of a nonvanishing correlation 
between unobservable variables and observable instrumental variables re- 
presented by M or @. A further paragraph will be devoted to justifying in detail 
this procedure by the connection to model (LIFU-). 

The estimator #° thus obtained in a model (LIFU*), (R), (V,) will be 
referred to as canonical instrumental variable estimator (CIVE). From the 
alternatives to the MLE (cf. Section 3.3.1) existing in the model (LIFU*), 
(R), (A), (N), (V,) result alternative instrumental variable estimators in an 
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analogous way. In this framework an asymptotic comparison is then carried 


out, in the explicit case, in Section 3.4. The comparison of the MLE and alter- 
natives results as a special case. 


Before proceeding to a detailed development let us give some more basic 
notation. 


(vii) For A € Meter) let 
area dray 


F, isalwaysa regular lower triangular matrix; for A; € M,.(p_y,% = 1,2, 
it holds that 


sr lig x de Bee Fy a Y Sarg 
OY AS dir ceed ens Ue lead Be Uae ae 


(viii) For linear spaces Y,€ 2s, %= 1, 2 and matrices X €Mixm, Sir t, 
my € IN, + = 1,2, let 


Sgii= XX’ 
Qr.y, = XPyX', Sx.y, = XP y.X' 
Oxyry, = XPyyyX: 
Here, on the left-hand side of the last three terms, J/; is allowed to be 
replaced by matrices Y; if R(Y4) = Y;. 
(ix) For 2 from (iv) let 


SO 3S 25) ya 1, 2"SO= (m —n) Sz~ 


Oo” := mQz.% —nm 8”, »=1,3 
OP := mQz.% — nm 2, 
E = Z,M; 


A 


BO (Pet Sry Ls (eel a rape 
(x) Let 
COST SOs; 
O35” := mQz,.0-u, — nm 3S*, y= 1,2,3. 
(xi) For A ¢ Mt! (for the definition see [A 1.2]) let 1,.(A) denote the (uni- 


quely determined) eigenspace corresponding to the J greatest eigenvalues 
of A 
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(xii) For a random variable £ with values in & let 
TOL) = LAE SDY LY (Lt € Mpg, RLY) = £4, 
Vi== el eo. 


According to [A 3.14], R(S®) = J holds almost surely, so that accord- 
ing to [A 1.3], almost surely 


L!S™L! ¢ Ke 
(xiii) Let assumptions (B,), v = 1, 2, 3 be defined by 
(B,) n= |_ 


(Bs) m=n+r. 


a 


In the following the models under (V,), » = 1, 2, 3 will mostly be treated jointly, 
often suppressing the index y in the notation of (ix), (x), and others. 


3.3.4.3 Maximum likelihood estimation in linear functional relations 
with nonrandom unobservable variables 


Let us consider the models (LIFU*), (BR), (A), (N), (V,), » = 1, 2, 3, under the 
assumption gq < r < p. At first we indicate a representation of the model which 
establishes the connection to model (3.1.38). For this recall Remark 3.3.4. 
Let M, € Mip—r)xm be given. Because of (A) we have R(M;) SW. 


(xiv) For £ € % let 
My = (My € My xm | RLM + Me)) = £, RM, | MY) SY}. 


For L¥t ¢€ M2 of CMn xg UC Utnxmii =v (p —*); Vien 
let 
Mint z.uv i {iM < lave | Im, c Ale uM, € Mx (pany) 
Ms, — M,.U+ i.V, L*' i, = 05 xn, L*’' MM, = £. 
Lemma 3.3.1. The model (LIFU*), (R), (A) as a distribution model for Z, 
can be described by the following relations: 
Z, ca M, + (e: 
M,€ U U Mis z.00n (91) 


L*+eMs, LEM xa 


eee) ee COS SAP ER Pe 
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(for independent Oy, i = 1,..., m), af 

OEM ms M=n—(p—7), AU')=WN AM). 
Proof. Note that the model (LIFU*), (R), (A) can be described by 


Z, = M,+* 
ais : (92) 
M,€U My 

LER 


with $* as above. Let £ = Fz(J+ + J£*) be a representation of £ according 
to part (a) of [A 1.6] for 2 € M,y(p-», £* € &,-4. Now, in view of [A 1.7] 
we find that MN) = V@azyy if AL“) = f+, Lf = —2I™, V = Mi;,, 
R(U') = W \ AK(M}) is satisfied. Hence (92) can be written as 


2 
M, € WU) Wy) Mies, er L*t,U,My> 


De eM EeM,x(p-r) 
and with {—E’L**| Be Mx (p—n} = Mip-r)xq the assertion follows. IM 


This representation of the model now leads to a model (3.1.38). Thus under 
specifications (N), (V,) we can exploit the results of Sections 3.2.3—3.2.5 on 
maximum likelihood estimators (MLE) in models of this type. From the MLE 
of the parameter M, one then derives the MLE of the parameters L*!, £ and 
from this, using the relation of My and Mi«1,z.y,4, according to [A 1.7] one 
obtains the MLE of the parameter £ (cf. Section 3.2.2). 


In the following lemma we give a relationship of the statistics introduced 
under (viii), (ix) and (x). There we refer to the model (LIFU*), (R); in addition 
let MW € Quin be given with n => p —q, RM) SW. 


Lemma 3.3.2 In the model (LIF U~*), (R) it holds that 
Qz.y = Fe(I*Sy, J" + JQz,.w.n,J') F's 
Oo = Fa(Jim Sy J" + IQ") Fs 

if RM{) SW € Qnny rn =p — q is fulfilled. 


Proof. Apply [A 1.5] for A =Qz.y, Au =Sy,, Ao = 2M}, Av = Qz,.y- 
Here A(A) + J = RP? is satisfied because of 7[A,,] = 7[J!’M]) =p —r 
({A 1.3])., The decomposition for Qp results from 


RQ — mQz.w) S I 
and F;J=—J. 


Theorem 3.3.1 In a model (LIFU*), (R), (A), (N), (V,) let the assumption 
(B,) be satisfied. Then it holds that 
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(a) An MLE # of £ based on the continuous density for Z, exists almost surely 
and is a.s. uniquely determined. It is given by 


2 = Fa(d* + JL*), B= 7Z,M+ (93) 
and the almost surely valid relation ; 

£* = S¥%y,,_ (S*-HQES* ah), (94) 
Here S* € M7 and S*-12 QFS* 1? € M"—1 almost surely hold. 
(b) 2 = Qok Boe (95) 


is valid for 
Ey = (P5. + [S*}"”) F5'- 


(c) The relations (93), (94) and (95) remain valid if OF (correspondingly Op 
according to Lemma 3.3.2) is replaced there by QF + AS* where 2 is an 
arbitrary real random variable. 

(d) Under (V,), v = 1, 2 there almost surely exist MLE 6? and & of o; and &,, 
respectively, based on the continuous density for Z,; these are a.s. uniquely 
determined. They are given by 


6? = (mr) tr [S*Sz.y + WL) Qz.¥) (96) 
S = mm — n)S + SIL) (OM (L) S + nm-18) (97) 
(IT (2) from (xii)). 


Proof. (a) We consider the representation of the model (LIFU*), (R), (A) 
according to Lemma 3.3.1 in connection with the distributional assumption 
implied by (N), (V,). This just represents a model of the form (3.1.38) for which 
the problem of maximum likelihood estimation was treated in Sections 3.2.3 
to 3.2.5. There some different notation was used. There is the following re- 
lationship between the notation in the model according to Lemma 3.3.1 and 
(v), (vi), and (x) on the one hand, and that used in (3.1.38) and in Sections 
3.2.3 —3.2.5 on the other hand: 


(3.1.38), ‘ = 
Section 3.2 p ng ZM,M, 2X T+ mm=n) 18 mQ—n(m—n)I8 


Section 3.3.4 r p—r Z, M, M, SF L*+ S* Qo”? 


The other notations used in the following coincide. From Section 3.2.4, 3.2.5, 
one can see that the conditions (B,) ensure boundedness of the likelihood 
function and existence of the MLE of the parameters. Let the resulting MLE 


of the parameters L*+, M, be denoted by L*!, Mg. 


20 Nonlinear Regression 
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Under (V3), from (3.2.67), (3.2.68), and (3.2.70), one obtains 
R(L*+)+ ae S*2y pat mg yS*'e) ; 


From Sections 3.2.3, 3.2.4, one can see that this formula also holds under (V,), 
y = 1, 2 (with the determination of S* according to (x)). Here we have almost 
surely 


S*-U2m-1Q7,.yS*-? e mela) 


since, as shown in Sections 3.3.3 —3.2.5, a.s. r[m719z,.y] 27 —q and S* « Mr 
and since the r — q + 1 largest solutions of (3.2.68) are a.s. different. 
Now 


igs mney ys?) Pais Nrr—g(S* 2 2Qs S*-U2) 


is valid, and for M, one obtains from (3.2.76) 


~*~ 


M, = Z,V*. (98) 


With Lemma 3.3.1 and [A 1.7] the assertion follows. 
(b) Taking into account 7[M,] = 7[J/’M] = p — r and [A 1.3], from Lemma 
3.3.2 one concludes that 


Ob Lol = FaItm Sy J + JQIS* J) Fy ZL 
= FalJ¢m-8y,J" + IQ3S*J') (I+ + TL*) 
= Pa(J+ + JP) = 2. 
(c) The assertion w.r.t. (94) and thus to (93) follows from 
tesa S*UAQ3S* A) = th gf S*AQES* + AT,) 
= tra St Q5 + AS*) S*1"). 


As above, from this one obtains the assertion w.r.t. (95). 


(d) By M,, 62, &* we denote the MLE of the parameters M,, o7, X; in the model 
according Lemma 3.3.1, with the distribution assumption caused by (N), 
(V,), » = 2, 3. We consider the specification (V3) and initially use the denota- 
tions of Section 3.2.5. (3.2.68) and (3.2.71) imply 


m&te£(I, + Dy) Dal’ = HLL, + Dy) 
x LYQLAT, + Dy) Lee 
= Pgnpi8-?Q8 PP gags. 
From this and from (3.2.73) one obtains 


Ape ae mASU2P g144 1 S-U2QS-12P eins SU2, 
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consequently, in the denotations of (x), 

>* = (m —— n) m18* 4. SUP gspop er S*U2OF S*-U2P cing gs STH? 

+ NM S*FU2P corp er S¥U2 (99) 

According to part (a) of [A 1.6] and (a) above, for f+ := Fy, JL* we have the 
relation R(L+) = #+. By virtue of Lemma 3.3.2 one obtains 

LY’ Qofht = LF g(J4m8y, J" + JOR’) Fi? = L*Q3h* , (100) 

SLi = JS*J'L+ = JS*J'F IL* = JS*t*, (101) 

iY’Si! = £*'s*7*, (102) 


As with (a) one obtains the MLE & of 2; in the model (LIFU*), (R), (A), (N), 
(V3). Now 2. = JX7J’ holds due to (vi); from (99)—(102) one obtains the 
assertion for © = JD*J’. 

Now we consider the specification (V.). From Section 3.2.4 one obtains 


6? = (mr)? tr [S*-(Z, — M,U — MV) (Z, — M,U — M,V)’, 
which implies, under consideration of (98), 

6? = (mr) tr [S*-(Sz,. + (Z. — MU) PyAZ, — M,UY)|. (103) 
Under (V3) (in the notation of Section 3.2.5) from (3.2.73) one obtains 

tuet sir. 
therefore (3.2.74), (3.2.75) imply 

B=, —2 (L214) it) ZU. 


(Here U* denotes the Moore-Penrose inverse; see Bunke and Bunke, 1986, 
[A 1.17].) 

It can be shown (cf. the proof of Theorem 3.2.9) that this formula obtained 
under (V;) is also valid under (V,) (2 nonrandom). Section 3.2.4 implies its 
validity also for (V.) (for & = S8*)), One thus obtains 


M, cs (i, as S* RL DAL Se het) 2 L*+) Z,U+ 
and 

(Z> = M,U) Py: = S*Y2P ceusperS* ?Qz, yS*?P gerngesS?. 
Hence (103) gives 

6? = (mr)-+ tr [S*-187 + P cwthpni Sta Oz ye oe) 


= (mr)-* tr [S*187. wy ie (L*1'S*f*+)-1 LO; gud}: 


20* 
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Analogously to (100) one obtains from Lemma 3.3.2 for L+ = Fy JL* that 
b'Qz.yh4 = L*'Qz,0.u Lh", 
which establishes together with (102) the assertion for 67. 


Let the MLE of £ obtained under the specification (V,) be denoted by 2. 


Remark 3.3.7 Under (V2) Qo is a function of the unknown parameter o;; 
then statement (c) for 2 = 1 — of provides the form of the MLE as a function 
of the observations. Let Z, ~ P%* € (LIFU*), (BR), (A), (N), (Ve), (Bz); o% be 
the corresponding parameter value; ? be defined as above; and ?™ be the 
MLE of ¥ under (Vj) for V; = {025}. Then (c) implies ?© = #®), 


Remark 3.3.8 According to Remark 3.3.6 one easily obtains the form of the 
MLE in the case r = p: we almost surely have 


i? 2 SY2y | 9-g(SY?2QyS- 12). 


Hence, in the case r = p, Sf is the eigenspace to the p — q largest eigen- 
values of S~1?Q,)S-1?. In the case g <r < p, Theorem 3.3.1(b) provides a 
corresponding statement: E,? is an eigenspace of ByQof%, which however does 
not necessarily correspond to the p — q largest eigenvalues. But this is the case 
for certain realizations of Z. 


Corollary 3.3.1 Let (Ai), a= 1,.:.,p, 4S... SA, be the ordered p-tuple of 
the eigenvalues of EyQ ok}. If 


Aq —< Amin 1S y,], 
then 
Se ocr BS Np, p-q(BoQoLs)- 


Proof. As almost surely S* € IM? it almost surely holds that r[E)] = p: Theo- 
rem 3.3.1 (a), part (a) of [A 1.6], and Lemma 3.3.2 imply 


E,? bye + Ip pg S*-Y2QES*-12) 


= RI*m Sy I") + 1p. Ps EoQoEPs)- 
Now 
E,Q.k) = J¢m8y J" + Pk QokiPs 


establishes the assertion. 


Remark 3.3.9 In the model (LIFU*), (R), (A), (V,), i.e. in the model without 
normality assumption, the estimator £” remains almost surely defined by 
virtue of assumption P”: < y, and the relation u, << P” valid under (N). In 
accordance with the explanations of Section 3.2.2 we call ?” a weighted least 
squares estimator (WLSE) in this case. 
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Remark 3.3.10 The statements of Theorem 3.3.1 as well as the following ones 
carry over to the explicit case, i.e. the case where (R) is replaced by (Ex). 
Indeed this means the introduction of an additional restriction in the model: 
L£ € Q (Myx (p-q- But according to Remark 3.2.4 it already almost surely holds 
that L* € e(Mgx(r-q)) (under the assumptions of Theorem 3.3.1); therefore, 
according to part (c) of [A 1.6] one almost surely obtains ? in ‘explicit form’: 
we almost surely have ? € e(M,x:p-q)). Then, for the estimators B;, i = 1, 2, 
of the parameters B;, 7 = 1, 2 (cf. (ii)) it almost surely holds that 


B, pals e-(S*2y, .(S*-2Q38*-12)) 
and 


A 


B, —— Is E. 
For the estimator B of B one obtains almost surely 


B= o\(Fa(J* + JZ*)). 


3.3.4.4 Estimation using instrumental variables in linear 
functional relations 


To provide a heuristic basis for instrumental variable estimation in the model. 
(LIFU*), its connection to a model with random unobservable variables, i.e. 
LIFU- (cf. Section 3.1.5) is of importance. 


Let p, 9,7, MEN, GSrSp,qgd<p, m= p—zgq, and a linear space, 
JI, J € &,, be given. Furthermore, let P* be a set of probability distributions 
P over [IR?, B?] with 

fodP =0,, —- Rex’ dP) =I 
and let 
ME = {E € ME | AZ) € Lpp-g}- 


We consider a distribution model (LIFU~-) for a sequence {2;}j<1,... Of in- 
dependent random p-vectors 2;,7 = 1,...,™, which is described by the follow- 
ing relations: 


z=utSs 
Ue PF Sao {N (9p, 2',) | 7a € M=}, 
SOF 


(for independent , 6). We consider the problem of estimating the derived para- 
meter 
Mae tiara) = 
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The assumption R(Z,) € Vp,p-q can be construed as an assumption of a linear 
functional relationship between the components of the (partially) unobservable 
random vector U. 

Obviously, the model (LIFU- ) corresponds to the model (LIFU*) if (2;);—1,.. ms 
(Ui);-1,..m 18 put into correspondence to Z, M (cf. comment to Definition 


satisfied. However, a normal distribution family is distinguished here in a 
natural way. 

As in the model (LIFUt) we will make additional assumptions (R), (N), 
(V3) and also consider explicit models (LIFU-) (such ones which satisfy (Ex)). 
As above, the special form of J and Jy is assumed. Remark 3.3.2 remains valid 
in models (LIFU-), (R), (N) and (LIFU*), (Ex), (N), respectively, in particular 
because of the normality of P*. 

Analogous to the case of (LIFU*) the assumption 2(D$).= J causes that 
certain components of the random vector 1 are observable in the case r < p. 
In the case r = qg a reduction to a regression model results (with stochastic 
regressors; for this compare Example 3.4.5 below). 

Now, the randomness of u;, 7 = 1,...,m assumed here in distinction to 
(LIFU*) causes that the parameter £ need not be identifiable in the model 
(in the sense of (90)). 


Theorem 3.3.2 There exist p, gq, r€ N, dS rS p,q <p, so that for any 
m =p — q the parameter £ is nonidentifiable in the model (LIFU—), (R), (N). 
(V3) (cf. Remark 3.3.5). 


Proof. From the proof of Theorem 3.1.1 (Section 3.1.4) one can see that its 
statement remains valid if if relates to a model (3.1.5), (3.1.6) with « = 0, 
E§ = 0 (homogeneous case). But this model is a special case of (LIFU-), (Ex), 
namely for r = p = 2, gq = 1. Then the modified assertion of Theorem 3.1.1 
states that the parameter B = o-1\(f) (according to (ii)) and thus £ are non- 
identifiable in the model. Hence, this holds in the model with (R) instead of 
(Ex). 


Therefore identifiability of £ in a model (LIFU-), (R), (N), (V3) has in general 
to be guaranteed by additional information or modified model assumptions. The 
cases mainly considered in the literature as the following ones: 


(A) In comparison to (N), (V3), additional information is assumed with respect 
to DG, i.e. (N), (V,) or (V2) or another restricted class of normal distri- 
butions. For this we refer to Section 3.2.1. 

(B) Under (N), (V3), ?” is fixed as class of nonnormal distributions (cf. Theorem 
3.1.1, Section 3.1.4). Then the minimum distance method yields consistent 
estimators (Wolfowitz, 1952, 1953, 1957). Other procedures have been 
given by Neyman (1951), Rubin (1956), Spiegelmann (1979). 

(C) Method of instrumental variables. Here an enlarged model is considered 
instead of the additional information mentioned in (A). 
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As in Section 3.3.1 consider independent observations U;, 071, .+., m, of an 
instrumental variable (IV) u, here assumed as random. For n, € IN let u take 
values in IR™. For the present, uncorrelatedness of wu and is required. In the. 
following we proceed according to Remark 3.3.6. It can be shown that the 
statement of Theorem 3.3.2 remains valid under g < r < 9, too. 

We introduce the following notation: 


(xv) Wo := [Uta], %:=[ule], ns=m+p—r 
OW i= [ef Jt’) at [ur J1’2] 
(Because of (DG) = J) it almost surely holds that J+’ = J*’z.) 


(xvi) For random vectors v;, 7 = 1, 2, let 


Cf , f 
E,,1= Dvy — Enp, = Eve}, 
Paton K 
Ba enki ig ys eed Sh hae 


We consider a distribution model for the sequence {2 ;};_1,.m of independent 
random (n, + p)-vectors 2;, 7 = 1, ..., m, which is described by the following 
relations: 


% = Wo + [0n, 15] 

Uo ©) Pre := 1 n.+0( Ono» ((24)) 24-73) | ((24))§243 © ye ? 
((2,))fz33 « Te}, 

SOF 


(for independent jt), $). We consider the problem of estimating the derived 
parameter 


£ := A(((24))}=33) 


(P". is nonempty: ife.g. £ = Ki ((Xi;))j=35) € Kt holds, then 24, € M>_, follows 
according to [A 1.5], and a completion to ((2;;))=15 € My becomes possible.) 
One recognizes that with the denotations of (xv) {#;};-1,... satisfies a model 
(LIFU-) where condition (R) is fulfilled by virtue of 2. € Nt>_, and [A 1.5]. 
Accordingly we denote the above model by (LIFU-), (IV), (R); it represents 
the enlargement the model (LIFU-), (R) which arises by considering the joint 
distribution of 2 and the instrumental variable w. 

In the model, due to (xv), (xvi) it holds that ¥,, = ((2i;))j=73, 2, = ((24,) ESS; 
then parameters 2,,,., X,,., and further ones are defined as well. Here 2, € My 
is assumed for simplicity. 
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Now we investigate identifiability of £ in the model (LIFU-), (IV), (8), 
(N), (V3). For this we note that the distribution of % is determined by 2,, 
and that, because of 2’, = 2, the relation R(2,,) S # holds. As 2 is a 
function of 2,, it follows that £ is identifiable in the model (LIFU), (IV), (8), 
(N), (V3) with the additional assumption 


(id) [2uw) =p — 4. 
(The representation (104) below and [A 3.17] imply that the model (LIFU-) 
(LV), (R), (N), (V3), (Id) is nonempty.) 

In the following we will also show the necessity of this condition in the sense 


that in any submodel of (LIFU-), (IV), (RB), (N), (V3), ¥ is identifiable only 
if (Id) is satisfied there. 


(xvii) For £ € ¥ let 


peeee 


ME = {2 € ME, |S = ((2y)) rs, (Za) HTS € MZ, 


ALN? DS (Dyas =a St ot, AU) mer, Itt) ed 


My = {MEN <1 MM) SL, IMM = (On un, ipa 


Obviously the model (LIFU-), (IV), (R), (N), (V3) for the sequence {2 ;};-1,. m 
of independent observations can be written as 


%o © (Na sp(Onr5, 27.) | 2z, © U WG}. 
LER 


With that we can now make precise the statement on the meaning of condition 
(Id) for the identifiability of £ in the model (LIFU-), (IV), (RB), (N), (V3). 
Obviously (Id) is equivalent to 7[2,,.] = p — q. 


Theorem 3.3.3 Let & be a nonempty subset of U It. In a model 
FER 
%y © {Pry | LEM, 0B, F ER 
for 


Psy = Neer Omans 2) 
the parameter £ is identifiable if and only if it holds that: 
Ry ~ Wap Oeryr= speec ts 20 = gt OL.) ee, 


(i — (Osa | I,) 1) 
imply 
1220) = p — gq. 


Proof. It was already shown that the condition is sufficient for identifiability. 
To prove necessity we assume the existence of 2, £ with xz, € Me n , 
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£ ER, such that 7[2,,] <p —q holds (with the above meaning of Z,,). 
We have to show the existence of £’€ R, £’ + F so that Qn M2 on M3. is 
nonempty. Let a mapping 


fs OM, > MZ X Moxa X MZ 
LER 
be defined by: if 2 ~ Nasp(On+p, Xz), 2, € Me holds, then let f(2,,) 
= (Ly, LX’ Xz.) (with the above meaning of w, 2). According to [A 3.17] it 
suffices to prove f(M%}) n f(Mz-) n f(Q) + BW. Now for f(2,,) = (Ly, M, Lew) 
the relation r[M] < p —q holds due to the assumption, and according to 
[A 3.17] we have M € M4. Since 9 is open and because of part (g) of [A 3.16] 
an £’ eR, £’ + £ with Me M4 always exists. Now [A 3.17] implies (2,,, 
M, Zw) € f(M-) and thus the assertion. 


Remark 3.3.11 Condition (Id) is thus the minimal (in the sense of Theorem 
3.3.3) identifiability condition for £ in the model (LIFU-), (IV), (R), (N), (V3). 
It represents a generalization of a condition of nonvanishing correlations of 
unobservable and instrumental variables (cf. (3.3.10)). One can see that w, 
which includes J+’u, functions as an instrumental variable. Now condition 
(Id) is assumed in the sequel. 


Now we proceed to obtain the maximum likelihood estimator of £ in the 
model (LIFU-), (IV), (RB), (N), (V3), (Id). First we give a further equivalent 
representation of the model where we denote the statement of a family of 
conditional distributions for z under the condition w = w by #|w = w ©). 


Lemma 3.3.3. The model (LIFU~), (IV), (R), (N), (V3), (Id) for the sequence 
{Zoi}:-1,...m Of independent random n, + p-vectors %j, i + 1,...,m can be 
described in the following way: 


% = [wi J’z] 


w © {Nn(On, Zo) | Xn € Mr} 
eh (104) 
z|w = w®© {N,(Mw, 2.) | Mé U My o MPZ4,, 
LER 


ere Valo w € IR”. 
Proof. The mapping given by 
P* > (P™, {P#=" | w € R%}) 
for 
aD (Eas Once) 205 2 =(0pxn, | Lp) % 


priw=4 -— Ni (Sake Us Xew) > w € IR" 


is injective. With the help of [A 3.17] one infers the parametrization stated 
where (Id) is equivalent to Me Mpy%. 
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It turns out that for realizations w;,7 = 1..., m, the family of distributions 
pale=v, { —1,...,m just generates a model (LIFU*) with the assumptions 
considered previously. Thus the results on maximum likelihood estimation 
obtained there can be made available for the present model. 


Theorem 3.3.4 In a model (LIFU-), (IV), (R), (N), (V3) let assumption 
(B;) be fulfilled. Then it holds that: 

An MLE ? of £ based on the continuous density for (2;);-1,... almost surely 
exists and is almost surely uniquely determined. For a realization {2i};-1,..m> 2 


is given by the MLE in a model (LIFU*), (R), (A), (N), (V3) decree to 
Theorem 3.3.1 if 


Wy = One 
a RU ((wi)imt,...m)') 


(w; = Gi On xr) 201 % = (Onsen: t Lp) cose += 1, ++ +5 M) 


ts put there. 

Proof. Let P*!”=” be defined for w € IR” according to Lemma 3.3.3; then PY #!?=” 
<r, w € IR” holds. Let py,” be the continuous density; then for the con- 
tinuous densities of 2) and w (with X = 2,,) it holds that 


p(w, t) = pyri (wt) = ph (w) py tr—"(t), — (w, t) € IR" x Re. 


From the above formula one recognizes in which way the marginal density of 
w and the conditional density of J’z depend on the parameter 2 of the joint 
density. According to Lemma 3.3.3, 2, and (M 2'z,y) vary independently in 
the model, £ is a function of M. Furthermore, we revenue that for realizations 


{wWi}i-t..m With 7[(w; _m] =, the expression I Di, sep: ;) Tepresents 


the continuous Hiei of Z, in a model (LIFU*), (R), (A), (N), (V3) with 


Ww = ANCE aie a Furthermore, for such w;, i = 1,...,m there exists 
max I py (wi). As 7[(Wi)i=1,....m] = ” almost surely holds, the assertion fol- 
EyEM> 1=1 [ 


lows “ith Theorem 3.3.1. & 


Here the MLE in models (LIFU-), (IV), (R), (N) is mainly presented to 
serve as a heuristic basis of estimation by means of instrumental variables in 
models (LIFU*). For a generalization of the model considered here, Robinson 
(1974) obtained the MLE from a heuristically based minimization principle 
and gave an asymptotic treatment, where results of Zellner (1970) and Gold- 
berger (1972) occur as special cases. Related problems were treated by Izenman 
(1975). Here we confine ourselves to stating that (LIFU-), (IV), (R) represents 
a model of independent identically distributed observations, for which general 
conditions can be specified (see [A 2.9]) which ensure consistency, asymptotic 
normality, and asymptotic efficiency of the MLE. 
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“ 


3.3.4.5 Estimation using instrumental variables in linear functional relations 
with nonrandom unobservable variables 


For determining the instrumental variables estimator in models (LIFU*) 
now the general connection with models (LIFU-) is crucial (cf. Section 3.1.5). 
We consider a model (LIFU*), (R), (N), (V3). According to Section 3.2.5 the 
likelihood function is unbounded here (cf. Lemma 3.3.1, Theorem 3.2.8, and 
the remark thereafter), this corresponds to the nonidentifiability of £ in the 
model (LIFU_), (R), (N), (Vs) (Theorem 3.3.2). We consider a solution of the 
estimation problem by means of instrumental variables, analogously to the 
case of (LIFU-). 

The random IV w is now replaced by a nonrandom matrix U € My. m 
additionally given in the model (LIFU*), called an IV-matrix in the following, 
with appropriate properties. Let 


Wer (U 1 MGV Wea seee's Ni=m+p—r. 


Analogous to the case of LIFU~ the entire known nonrandom matrix W will 
function as an [V-matrix in the following (cf. Remark 3.3.11). The nonrandom- 
ness of U assumed here (with random $) corresponds to the uncorrelatedness 
of § and wu required in the case of LIFU-. 
In accordance with Theorem 3.3.4 we define the IV-estimator as an MLE ac- 
cording to Theorem 3.3.1 in a formal model setup (LIF U*), (RB), (Aw), (N), (V3) 
for UW = R(W’). 
More precisely, assume that Z follows a distribution model (LIFU*), (R), (N), 
(V;). Consider the additional restriction (Aw), not present in this model, and 
estimate £ by a formal MLE under this restriction, with observations Z. 
According to Remark 3.3.3, at first a certain condition is to be imposed on 
n = dim @ to ensure the existence of the so defined IV-estimator. Remark 
3.3.13 below then implies the almost sure existence of this estimator as soon 
as in the model (LIFU*), (R), (Aw), (N), (V3) the MLE according to Theorem 
3.3.1 almost surely exists. Then condition (Id) has its counterpart in certain 
asymptotic requirements which ensure consistency. These requirements can be 
interpreted as conditions of an ‘asymptotically nonvanishing correlation’ 
between unobservable and observable variables represented by M and W 
respectively, and thus they are analogous to condition (Id) in LIFU™ (cf. 
Remark 3.3.11). 


Remark 3.3.12 A formal analogue of (Id) is 
[UW'| =p —¢q. (105) 
The indicated procedure is heuristically justified by the fact that the IV- 
estimator in LIFUt under (V3) is shaped after the MLE resulting from Theorem 


3.3.4 for a model (LIFU-), (IV), (R), (N), (V3), (Id). This construction now 
is to be carried over to specifications (V,), y = 1, 2. A justification for this will 
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be given in Section 3.4, where the asymptotic efficiency in a certain sense of 
the obtained estimator is shown under (V,), v = 1, 2, 3. 


Remark 3.3.13 For P%: € (LIFU*), (RB), (N), (V,), Ps € (LIFU*), (R), (Ay), 
(N), (V,), it obviously holds that P# << P%. 


Analogous to Remark 3.3.9 we now drop assumption (N) in the adequate 
model, and summarizing we give the following definition. 


Definition 3.3.1 Let Wy € Quin M2p—q with R(M;)— Wo. Let the 
model (LIFU*) (R),(Aw,), (V,) be adequate for Z,. Let a linear space W © &m,ns 
P—-F SNM, with R(M{)— WKH W, be given, so that condition (B,) is 
satisfied. Let ?(-) be the (almost surely defined) maximum likelihood estimator for 
£ in a model (LIFU*), (R), (Aw), (N), (V,) according to Theorem 3.3.1. The 
estimator which is almost surely defined by 


Lo og P (Zz) 


we call the canonical instrumental variables estimator (CIVE) of £ for the linear 
space W. Under (V,), » = 2,3 let estimators 6c and Yo of oa; and X, respectively 
be defined analogously. 


The weighted least squares estimator (WLSE) according to Remark 3.3.9 
and the MLE according to Theorem 3.3.1 are special cases for @ = Wy) and 
(N), respectively. If (Ex) holds in the adequate model, then, according to 
Remark 3.3.10 the CIVE Be of B is obtained by 


Bo := o Lo) : 


The linear space @ will be referred to as the 1V-space. 

In the examples of Section 3.4.3 it will be proved that the simple IV-esti- 
mator (3.3.9) and further ones (BLUE in the linear model, 2SLS-estimators) 
result as special cases for certain dimensions 7, q, 7, n. 

Beside the CIVE one can consider alternative instrumental variable esti- 
mators resulting in a natural way from alternatives to the MLE in the model 
under (Aw), (N) (cf. Section 3.3.1). The asymptotic comparison then forms the 
main subject of the following Section 3.4 where also the alternative estimators 
are discussed in some more detail (Section 3.4.5, Examples 3.4.10—3.4.12). 


3.4 Asymptotic theory for linear functional 
relations with nonrandom unobservable variables 
and with independent errors 


3.4.1 Introduction 


This section presents some asymptotic properties of estimators in the model 
LIFU* in the case of independent identically distributed error variables. For 
this we will rely on the approach to estimation by means of instrumental 
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variables developed in Section 3.3.4. The most important LIFU* models 
treated in Section 3.1.3 are also included. 

In the asymptotics of LIFU+ models it is the crucial feature that in the 
general case the problem of an indefinitely increasing number of unknown 
incidental parameters occurs. This distinguishes the present model from more 
common parametric models of mathematical statistics, in particular from the 
model of independent identically distributed observations, as well as from 
the linear model where the incidental parameters (the regressors) are known. 
The question of consistent estimability of the structural parameter, which 
arises in this connection, was treated in Section 3.1.5. There it turned out 
(Theorem 3.1.6) that for the consistent _estimability in these models certain 
restrictions are necessary with respect to the unknown parameters. 

This will be treated here in some more detail. Sufficient conditions will be 
stated for the consistency of the canonical instrumental variables estimator 
(CIVE) defined in Section 3.3.4: These conditions concern either the distri- 
bution model for the errors or the information on the incidental parameters 
(provided by instrumental variables or their generalization to be introduced 
here). 

The main part of this section, however, deals with the problem of efficiency 
of estimators. We ask whether the CIVE is asymptotically efficient against 
the alternative estimators of Section 3.3.1, and in particular against the 2SLS- 
estimator (more precisely against its analogue; see Example 3.4.10). 

Answering this question we must also take into account the specifics of the 
present model, i.e. the possibly indefinitely increasing number of the unknown 
incidental parameters. In a model of independent identically distributed ob- 
servations (under certain regularity assumptions) the MLE is an asymptotically 
efficient estimator (cfs [A 2.9]). Such a statement would also be of importance 
here since the MLE under normal distribution is a special case of the CIVE. 

The model treated here fits into the general scheme of independent, not 
necessarily identically distributed observations with a structural parameter to 
be estimated (cf. Remark 3.4.1 below). Under general assumptions, local 
asymptotic normality (cf. [A 2.7]) for a fixed sequence of incidental parameters 
can be proved for such a model, and thus an lower bound can be established 
for the limit covariance matrix of asymptotically normal estimators (Ander- 
sen, 1970; Philippow and Roussas, 1973; Ibragimov and Khasminski, 1979). 
But with unknown incidental parameters the MLE attains this bound only 
under restrictive model assumptions (Hoadley, 1971); in general this is the case 
with known incidental parameters. The latter forms the basis for the theory 
of asymptotic efficiency in simultaneous equation models of econometrics (cf. 
Theil, 1971; Schénfeld, 1971) as well as in the linear model (Philippow and 
Roussas, 1973; Nussbaum, 1977). But in general the MLE does not attain the 
lower bound in question and the problem of efficiency remains open. 

We propose a solution by considering a special class of asymptotically normal 
estimators (asymptotic Q,-estimators) and investigating optimality within 
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this class (in the sense of the covariance matrix of the limiting distribution). 
This procedure is similar to that in the linear model, where in certain model- 
specific classes of estimators optimality statements are obtained, also in an 
asymptotic sense, by means of the Gauss-Markov theorem (Bunke and Bunke, 
1986, chap. 2). There is also a relation to the class of asymptotic minimum 
contrast estimators in the case of independent identically distributed obser- 
vations, within which the MLE can easily be obtained as optimal (Michel 
and Pfanzagl, 1971). 

Let us now sketch for introductory purposes the underlying principle in the 
case of the simplest model. We consider the two-dimensional model of a homo- 
geneous linear functional relationship 


e= E+ hi 
(1) 
y; = BE, + Sai, ¢=1,...,m 


with unknown structural parameter 6, unknown incidental parameters &;, 
7 =1,...,m, and independent identically distributed normal error variables 
Gi := (51, Sai)’, @ = 1,..., m. According to Section 3.1.5 additional informa- 
tion on the incidental parameters or the error distribution is necessary for the 
consistent estimation of 8. A practically important assumption of this kind 
consists in the setup of a ‘model with replicated observations’ (see (3.1.4) and 
Example 3.4.7 below) 


y= 64+ oy 
Yip Ply Soe Ps, EO) iS ee 


in which a consistent estimate of DO can be obtained. For simplicity of pre- 
sentation let us assume that in model (1) a consistent estimator &,, of Dg (of 
the ANOVA type) is given. Then for the statistic 


m J m 
Vat of Yay: 
#=1 ; j #1 
On — m1 so a a ee { Paar ee a > 
m m 
De i 2; Yi 
1=1 | t=1 
under the assumption m~! }Y &} ———+ h > 0 we have the relation 
i=1 
P 1 
On moo” B ie P) : (2) 


On the basis of this relation various consistent estimators can be constructed. 
But it is not clear which of these should be preferred. We have seen that the theory 
of the MLE can not answer this question. 
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By A we denote the vector (411, 2, Go) for a symmetric (2X 2) matrix 
A = ((a;;))iZ¥3. Then (2) can be written 


On ——+ (h, hB, h6?). 


The set 4 := {(h,hB,h6)|h > 0, B € IR4} forms a surface in IR3. Let us consider 
real-valued functions f defined on an open subset -4* of IR® containing A, with 
the property 


fh, hB, hp?) =B VWh>0, BER. (3) 
If f is continuous on 4*, then f provides a consistent estimator hie of B: 


Bm = Om). 


A function of this kind we have e.g. with 


f(x) = 2/2, A*® = {x = (%;)j21,2,2 | 21 +0}. 


Now, as a simple sample function @,, is asymptotically normal: 


L£{m'(0,, — (h, hB, hB?))\ + N3(0s, A). 


If f is continuously differentiable on 4* with derivative df € IR’, then for B,, 
this implies 


Bn — B = Af((h, hB, hB?)) Om + op(m-¥2). (4) 


Differentiating (3) with respect to h and 8, one obtains linear restrictions 
for df: 
df((h, HB, hB*)) A = Jo (5) 


for certain parameter-dependent matrices A, Jy. Let us consider general esti- 
mators B», which satisfy (4) and (5) for an arbitrary nonrandom, parameter- 
dependent C in place of df((h, hp, h6?)) (asymptotic Qm-estimators). These include 
in particular the MLE. For estimators of this kind it obviously holds that 
(mBq — B)) sar N(O, CAC’). 

A minimization of CAC’ subject to CA = Jy as in the theorem of Gauss-Markov 
yields a lower bound for the limiting covariance matrix. It turns out that the 
MLE attains this bound; hence it is asymptotically efficient within the class 
of asymptotic Q,,-estimators. In particular, it dominates some alternatives 
considered in the literature (modified 2SLS-estimators, minimum-contrast 
estimators based on Q,,). 

The following investigations concern a general model (LIFU*) with instru- 
mental variables according to Section 3.3.4 which comprises a number of 
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variants. We shall develop a general version of the procedure outlined (Section 
3.4.5). Accordingly, we fix a distribution model {P, | ¢€ @} for the infinite 
sequence of observations {2;j};cxq so that Z, := (%;);=1,...m for m = mo obeys 
a model (LIFU*). 

For given p,q,7€ N, p=r=q, p > q, define in case r < p 


die (eax ls Jt = [eer Onepenl 
and for r = p, 
oh awes J ere) 
and let J = RJ). Let Py be a set of probability distributions P over [IR?, 8?] 
with 
J Pie hig = a (AP 0b, ey wl ered tse 
where J’P is the image of P under the linear mapping J’ : IR? > R’. 
Let also be given: 


— an ™ € N, a sequence {n(m)}n>m, of natural numbers, and an « € [0, 1] 
with 


n(m) mt 


+a, nlm) =p —q for m= m 


m—->oo 
— a sequence (Y,,,}m>m, of linear spaces with 
On € eas m = Mo 
— aset Ve {LE MF | A(L) = J}. 
Let 
PF := {212P | VEV,P€ Pp}, 


and let PS denote the countably infinite product of an element P of ?é with 
itself. Let 


Me r= {{wiien |ui€ R?,1€N, AS € 24,4, € M2: 
£+ FT =R?, R(wi)ins,..m) = £, A(((Miins,...m) I+) Vn, 


and 

B= (P| {udion), PPE P, (uiiew € M”. 
Let {P5, 6 € O} be given by 

tio = dion + Siiewn ~ Po, 

Siiew ~ (P°)%, 

O == TIE 


3.4. Asymptotic theory for linear functional relations 321 


We consider the problem of estimating the parameter £ on the basis of observations 
{@i}i—1,....m for m — oo. 

This model induces a distribution model (LIFU*), (R) according to Section 
3.3.4 for each of the random matrices Z,,, m = mp; the stated model is the 
analogue of (LIFU*), (R) for an increasing number of observations m. The 
increasing dimension of @,, is admitted in order to allow a unified asymptotic 
treatment of the model including the case @,, = IR”, i.e. the case of a model 
without instrumental variables. In addition, this assumption allows us to 
cover further interesting special cases, like the one of an increasing number of 
groups in the ‘model with replicated observations’ (cf. (3.2.1), Example 3.4.7). 
In Section 3.4.3 we will provide more examples which demonstrate the use- 
fulness of this general model. The model can be regarded as one with generali- 
zed nonrandom instrumental variables. Besides nonrandomness, the generali- 
zation consists in admitting dim W,, ——> 

We consider some further assumptions about P* and {u;};ey. Let 


Oo 


geres 


(N) Po = {N(0p; Pa) 
(V) v=, 

with V, from Section 3.3.4, (iv), »y = 1, 2,3 
(Ex) £ + Jo = R* 

(Jo = R [0 p-q) xq | Zq]), cf. Section 3.3.4) 
(A) MM NSW. m = mM 


If [Ex) or (A) are assumed, we speak of an explicit or adequate case, respec- 
tively. ; 

(C1) H]=p—4 

(C2) m1M,,M/,——>+ H, 


m m—->co 


(C3) (a) m — n(m) — 7 


m—->co 


(b) (m — n(m))? My, PwsM in aoa? Ha 


m m—->oo 


(c) [Hf —oflul=p —9 


(C4) Rie Ne meg 0] 
(C5) max m|M,Pw,ey” ||? sz 0. 


1sism 


Here e'”) denotes the ith unit vector in R™ and 1,, := (1,..., 1)’ € R. 
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All statements of Section 3.4 refer to this model, with a parameter space which is 
restricted or specified from PE K M° by the current assumptions. This model 
we denote by {P| # € O} where O represents the parameter space in question. 
As in Section 3.3.4 (cf. Remark 3.3.4) we do not distinguish between 
{P, | ® € O} and the induced model for {J’2;};eq with given {J+’z;};-y. Further- 
more, we proceed according to Remark 3.3.6. 

Let us now fix some notation occurring throughout Section 3.4. 


(i) We use the denotations of Section 3.3.4, partially supplied with an index 
m; in particular, 


Ge ¥' ) (y) (r) *(v) iy(r) 
Gate Lita Cm? Oa? Qom ? E,,; Eom? Mim, Mom, 
Lom) Ocm> Lens Sey Sir dare ie 2, 3 


and parameters B, B,, By, X,, 07. 
(ii) Let 


1g) = (dR) 
for T°?) according to Section 3.3.4 (xii), » = 1, 2, 3 
(iii) Let 
O” := OF + n(m) m (63, — o;) LT o(r), ) =o 


with X from Section 3.3.4, (iv); 6%,, 18 defined in Theorem 3.4.3 below. 
Let 


me = Vim + r(m) m\(Gom — 07) I'ZII (rv), v = 1,2,3. 
Then analogously to Lemma 3.3.2 we verify 
QO = Pp (J4m8y, JY + IQS’) F 
(iv) Under (Ex) let V,, be defined by 
WM, sph p 3 
This uniquely determines V,,. Let 
Nore pet OGG) Nine 
Then N,,, = U,,, holds. 
(v) Let 
A, :=mU,Py M’,, Hon = mM, M,, 


Aim i= (m — n(m))-1M »Py1M', « 


. 
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(vi) Let 
re eae 2 aan ofl s(v), G° := A, — n(m) m= Aanl g(r), 
PSS AB TA Sie 

(vii) Let 
GP =H, »=1,2, GP := (1—«) A + off. 


(viii) Let A be one of the matrices defined under (v), (vi), and (vii). Then 


~ 


R(A) S # is always true. This analogously holds for the limits for m > oo. 
Thus, under (Ex) there exists an A € Mt,_, with A = [gAL’, and A is 
uniquely determined. Accordingly let H, H,, Hy, Hom, Ha, Ham G”, 
Go), Gy”, v = 1, 2, 3 be determined by 
A= LAatl,, © H,,'= LpH,,by, ete. 
and g by g = Lzg. 
(ix) If AG”) = F holds, then according to [A 1.5] there is a representation 
GO = Paw (P5314 Py. + IGS’) Fi) 


with uniquely determined E™ EM ry (pr, GEM, As JUG I+ 
= lim m18y,, » = 1, 2,3, and since R(H4) S F (because of R(M;p) 


m—->oo 


CW ,,), We infer, as in Lemma 3.3.2, that #) is independent of ». Let 
B= &", tie Ber pas 
(x) Let 
8 := 25, »=1,2, 8:=5,4 A, 
with from Section 3.3.4 (iv). Let 
By := (Pg. + [Sy }?) Fe, vy = 1,2, 3. 
(xi) Let 
Th TAA yt oY 
for L! € Myyqs R(L+) = #2. 
(xii) Let 
M(m) := m — n(m) Ly3)(r), V1 $2;'3:. 


(xiii) Let I’, be the projection operator in R* onto the linear subspace {A| A 
€ M,}. We have 


y= Fle tI), = sle+ V2 


(for definition and properties of the matrices [{s,1, I{s; see [A 1.10]). 


21* 
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Let f; € Nee x8(s+1)/2 be such that 
r= 1 hee 1 ai Lenny 
(xiv) In some examples we additionally use the following denotations: 


Xoq i= (O(r-a) x (p=) ier iQ ) Zn 


eh) CI 


Y,, := (0 Te) Leg 


qx (p—q) 
5S (ahs | Xcel 
Then we have 
Leg Mint dent aly | Lew | Xam Yl: 
Hor A =O Fea) 
Toy t= iy, Lk = 0s. 


(xv) Let 
®, = HO ® OS’ 
We = EGC’ & 66’. 


For the sake of clarity, some further expressions depending on the parameter 
# will be indexed by #. As in Section 3.3.4 we will frequently suppress the 
index ». 


Remark 3.4.1 Consider a distribution model for a sequence of independent 
observations {2} jen: 


{(PesJiew | (7s Filion) € I x FY, 


whese y; is identifiable in the distribution model for z;, for 7 € N. There, 
the parameter y is called the structural parameter and the parameters yj, 
7 € IN incidental parameters. Accordingly, in the present model P* is a struc- 
tural parameter and 7;, 7 € N are incidental parameters. As p —r com- 
ponents of wu; are known, mw; has r — q functionally independent components; 
thus we recognize infinitely many unknown incidental parameters in the 
model {P, | #€ 6}. As mentioned before, this is the specific feature of the 
model with respect to asymptotic theory. Under the restriction (A) it persists 
in general: then the number of unknown functionally independent components 
of (H4i);—1,....m equals (r — q) n(m) and thus is admitted as infinitely increasing ; 
the convergence assumptions do not restrict (4;);1.m- Only in the case of 
constant n(m) under (A) does a model with finite-dimensional parameter 
space result (cf. Example 3.4.3). General models with infinitely many unknown 
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incidental parameters have been considered by Wald (1948), Neyman and 
Scott (1948), Andersen (1970a, b), and Pfanzagl (1970). 

In a suitable parametrization one obtains (P*, £) as a structural parameter; 
under (V,), » = 2,3, we consider the problem of estimating the structural 
parameter (07, £) or (X;, £), respectively. 

A sequence of estimators {IT mim=m, = vi mZm)}m=m, Will be referred to in 
short as an estimator in the sequel. 

We see that under the specification (V,) the canonical instrumental variables 
estimation (CIVE) Poor according to Definition 3.3.1 is defined for m => mp. 
For sufficiently large m also (B,) (Section 3.3.4 (xiii)) is satisfied, so that also 
under (V,) the estimator (6%m, 2cm) of (o7, £) is defined. Assume this holds 
for m = mo. Under (V3) condition (B3) is needed. Now, if (C3) (a) holds, then 
(Bs) is satisfied for sufficiently large m. We see that under (V3) (C3) (a) the 
estimator (Lom, Lom) of (Xz, £) is defined for m > mp. 


3.4.2 Consistency 


In this section consistency of the CIVE of the structural parameter is to be 
proved. This will be carried out for a general specification (V,), vy = 1, 2, 3, 
i.e. for the case of a not necessarily normal distribution model / for the error 
variable ¢. The problem of consistency of the estimators 6%,, and ricci. 
vestigated as well; here in general inconsistency obtains. In the adequate case 
the estimators can be modified to be consistent, or consistent estimators can 
be found. 

After this the meaning of the assumptions made in each case to prove the con- 
sistency of CIVE is discussed. These represent restrictions on the parameter 
sequence {;};cx Which are necessary in dependence on the error distribution 
model (cf. Section 3.1.5). These assumptions can be construed as analogues of 
the correlation condition (Id) in case of a random IV model (LIFU_). First 
we show consistency of the CIVE #¢, under (V,), » = 1, 2, 3. For this we need 
the following lemma. 


ML mPm Li, — Eom Link LBin 


wl 


Lemma 3.4.1 Let {Pinbmsm, be a sequence of projectors Pm € M=, m= mM. 
= ME mPmMn + MnPnbm) + Opn (tr [Prn]) 


Then 
Proof. Since 
M1 LinP mL oa Eym Zink mL m = m*USmPmM, As Nie ee) 
ot Nome dees ae ra Bye 0., Pas: 
it suffices to show that 


DS nPmbn = O(m- tr [Pnl) 
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or 
Da'm§ ,Pybnt = Om? tr [Pp}) 


uniformly for a € {@ € R? | ||z|| = 1}. Now a’S,, is a vector of independent 
identically distributed random variables a’6;,7 = 1,...,m. We have 


Ea'5 = 0, DoS = aac 


Pay = Ha'So'a © ao'a = (a' & a’) Pela @ a). 
Thus, according to [A 3.18], 


Da'mSyPnbaa = m-* Y ph{(a’ ® a’) yela @ a) — 3(a’E;a)?) 


w=1 
+ 2m-? tr [Pn] (a’ 2,0)? 
for 
dap = ((pis) i= eee ac 


peres 


The assertion follows with 
m m 
m-* >) pis Sm? Y pix = m tr [P,]. 
i=1 i=1 
This lemma yields convergence statements for the matrices Qo, under the 


respective model assumptions. 


Lemma 3.4.2 Under (V,), » = 1, 2, 


mF Ge. 


m—->co 


Proof. Observe that 
LOon = Hg 
According to Lemma 3.4.1 it suffices to show that 
mE mPy,Mn + MnPw,Sm) = or(1)- 
But this immediately follows from 
Lyme mPo,Mn =%pxp> — Dobm = Im @ Xs 
Dyn 3h ,,P yp, My, = mA, © Z, = o(1) 
and the model assumption 
A,— > HH." 


m= m—-oo 


In the case of the model assumption (V3), the conditions (C3) are needed: 
For this we prove the following lemma. 
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Lemma 3.4.3 Under (V3), (C3) (a) (b), we have 


so 2 A, +E. 


mm m—-0co 
Proof. In Lemma 3.4.2, replace @,,, m, MM nPy Mn —— Hf and m+ co 
The following lemma is an immediate consequence. 
Lemma 3.4.4 Under (V3), (C) (a), (b), 


(3) P Fa 
i rverera ie 


The following lemma immediately prepares the proof of the consistency of 
~Cm: 


Lemma 3.4.5 Under (V,), (C1), v = 1, 2, or under (V3), (C3), we have 
(2) Om Par G, MG) =F 


(b) E,, ++ E 
(c) Eom > Ey, Ey € Mi xp: 


Proof. (a) The convergence holds due to Lemmas 3.4.2 and 3.4.4. As@ = lim Gn, 


m—>co 
LG, = Ogxm> mm, holds for Le M,x4, AL) = £1, we obtain 
L'G = 05.m and thus RG) S £. The assumptions (C1) or (C3) entail 7[@] 
= p — q; this implies the assertion. 


(b) Part (a) implies J+’OomJ+ —s J’GJ+ € M>_,; since by virtue of [A 1.5], 


m—* 


Lemma 3.3.2, and (ix) from Section 3.4.1, we have 
By = J’ QomI* (I QomI*) 1, B= SEIMIVGS!)4, 
the assertion follows. 


(c) According to Lemma 3.4.3, S,, + So; because of R(A4) S I (which is a 


consequence of 2(M’,J+) © W,,), we have R(So) = J. As a result [Sj]? 
—> [Sj }?2, RSP ]}/2) = J, and from the definition of Ey, (Theorem 3.3.1 (b)) 
the assertion follows. @ 


With that we are now able to prove the consistency of the CIVE Pom under 
the specifications (V,), » = 1, 2,3. Observe that the notion of probability 
convergence on the set &,,»-q is well defined by the pertaining structure of a 
differentiable manifold (cf. [A 3.16)]). 


Theorem 3.4.1, Under (V,), (C1), v = 1, 2 or under (V3), (C3), for the CIVE 
Gs a aes of the structural parameter £, we have 


Tom eee ge 


moo 
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Proof. Lemma 3.4.5 yee 
EonQonEom ae HGH, € MexD- 


Let (Ai,m)i=1,...99 41.m SS +++ SAp,m be the ordered p-tuple of the eigenvalues 
of Bom QomB on: Let 


Am = {Zn € Mom | Ae mm — = Aninlm 'Sy,,_,]} > 


where (A; m)i—1,....m> Mim are understood as functions of Z,,. 
Lemma 3.4.5 implies 


m'Sy,, wa? TU GT* € M5. 


m—* 


and then according to [A 1.1], gis —_+ 0. It follows that P;(4A,) ==> 0 


™- 


hence by virtue of Corrollary 3.3.1, 
PI {BomLom = "p,p-q (BomQomEom)}) a 0. 


The above implies H)GH, € M!?—” (see [A 1.2]); since M!?” is open in M, 
and 1p,p-q is conunnous on MP ([A 1.4]), it follows that 
Np.p-q{Hom@omEom) ao E,f£: From the continuity of the mapping 

Ae Ket Ad.) Aneta 2 ie Ur Pate 


(see [A 3.16d]), it follows that 


Eq np,p-q(EomQomLom) a AG 


m—>oo 


This implies the assertion: 


For the limiting behaviour of the matrix JT, the following statement can 
be shown. 


Lemma 3.4.6 

(a) Under ee (Cl), 
1h es a oll 

(b) Under bad (C3), 


A 


IE 


Se 7 


Proof. (a) Since the mapping f£ ++ is continuous on &,,, and 
{£ € Bap-¢ | £ + J = R®} is open in &,,-, (parts (c) and (f) of [A 3.16}), 
Theorem 3.4.1 implies 74, ——+ £ aiee to Beet ) of a oe there exist 
Ls} mzxmgs E+ such that Di, D1¢M,.,, 2 PLR = £1 with 
LD} + I+. In accordance with Section 3.4.1 a ae ces - : 4 (xii), let 


m m—0co 
TT, = L2(L2/ E74) LA! = EA (E2 ELA) D2’. 


/ 
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Then 
1, *—+ o2L'(LY EL!) LY = o3ll. 


(b) 
due to Lemma. 3.4.3, and of R(H4) Cf. w 


which holds 


Now we consider estimation of the structural parameter o? under the speci- 
fication (V,). 


Theorem 3.4.2 Under (V2), (C1), (C2), for the estimator {6%} m=m, Of the 
structural parameter o; we have 


6% + (1 —a + r-lgx) of + 1-1) tr [Z*(Ay — A). 


Cm mae 


Proof. Using (C2) one shows analogously to Lemma 3.4.2 that 


m Sz» Sone 


m Fis 


Then Lemmas 3.4.2 and 3.4.6 imply 
MUTE 7,0 oar WA + «%:). 


m m—>co 


Now 7H = Onxp holds; from the form of 6%, according to Theorem 3.3.1 (d) 
and from tr [J7X,] = dim 2?”£1 = dim J'f+ = q (which holds due to part 
(a) of [A 1.6]), it follows that 


m-* tr [I nQz,.W_] a> %929 (6) 
and hence the assertion. 


This means that the estimator 6%,, for o7 is generally inconsistent. This also 
holds in the adequate case; indeed, then (A) and the definitions of A, and 
Hom only imply that A, = Homt, H = Hy. However, in the adequate case 
consistency can be obtained by modifying 6%,,: the estimator 


Bin = (1 — a + 9 Mgay 1 Bb 


is consistent for o7 under (V2), (A), (C1). Furthermore, in the general (not 
necessarily adequate) case, provided that « + 0, we can immediately find a 
consistent estimator on the basis of (6): the estimator 


Gm = (omg)? tr [TmQz,,.0 5] (7) 


is consistent for Or under (V;), (C1), « + 0. This estimator is not defined for 
« = 0. Below we give an estimator which is consistent also in this case, provi- 
ded condition (C2) is met. 


Theorem 3.4.3 Under (V.), (C1), (C2) for the estimator 
Gbm = (mg)? tr [1 mQz.,] 
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of the structural parameter o; we have 


a P ‘ 
oom ———> 0;- 


m—co 


Proof. Analogously to Lemma 3.4.2 one shows, using (C2), that 


mQz, ——+ Hy + 2¢. 
The definition of M, implies 2(A) S £ and thus /7Hy = 0,,». From this with 
Lemma 3.4.6 (a) the assertion follows. @ 


The question of efficiency of the estimators for o7 and X; (under (Vs)) will 
not be treated here; the consistent estimator 6%, is needed in the following 
to construct the class of estimators to be compared with Pom. Now, the counter- 
part of Lemma 3.4.5 (a) for Q,, is as follows. 


Lemma 3.4.7. Under (V,), (C1) or under (V2), (OL), (C2), or under (V3), (C3), 
we have 


OSG SHOALS 


m-—->oco 


Proof. The proof is obtained immediately from Lemma 3.4.5, Theorem 3.4.3, 
and from the definition of O,,. 


Now we consider estimation of the structural parameter 2; under (V3). 


Theorem 3.4.4 Under (V3), (C3) for the estimator (Se ea of the structural 
parameter X', we have 


Lom — + (1 —a)(H4+2;)+ oz! P silt gi 02! 


CO 


Proof. Lemmas 3.4.4 and 3.4.6 (b) imply 


1, Qomll np ——> IGT’ = 0,» 


m—>co 


and with Lemma 3.4.3 one obtains 
SIT »n(m) m1 Sy es od llX;, = ake! Prt ys de!. 


From this and from the form of Yen, according to Theorem 3.3.1 (d) the asser- 
tion follows. @ 


Thus the estimator Yo, is in general not consistent; in the adequate case 
one obtains consistency if « = 0. The estimator S®) is consistent for 2; in 
the adequate case, according to Lemma 3.4.3. 


Remark 3.4.2 Let us consider the explicit case, i.e. the case of a model restric- 
ted by the assumption (Ex). Then under the assumptions of Theorem 3.4.1 


ie) 7 


\ 
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for the CIVE By, of the structural parameter B we have 


/ 


W a 
Bon Sar B 


because of the form of Bc, according to Remark 3.3.10 and the continuity 
of o~! (see [A 3.16 (a)]). 


Now let us interpret the obtained results on consistency as well as the as- 
sumptions. In particular, we wish to understand the assumptions as counter- 
parts of the conditions on the error distribution or on the instrumental variable 
in the model (LIFU-) of Section 3.3.4. Note that in the present model the 
dimension of the IV-space @,, is increasing in general; therefore the analogy 
with the model (LIFU~) with given dimension of IV is to be understood some- 
what loosely. 

Condition (C1) represents an asymptotic counterpart of (3.3.105) ((3.3.105) 
was not assumed in section 3.3.4). Ifm — dim Y,, sow? © is satisfied, i.e. under 
(C3) (a), then (C1) can be seen as an essential restriction of the possible values 
of the parameter {w;};-y or of the unknown incidental parameters. Thus (C1) 
expresses that additional information is available in the model in form of the 
sequence of the [V-spaces {@n}m>m,- In general the case @,, = R™, m = mo 
is included, in which the sequence of the IV-spaces {@ }m>m, does not provide 
additional information. In this case (C1) can be interpreted as a regularity 
condition in the sequence model, which is an asymptotic counterpart of 
3 G1 San I 

Thus condition (C1) could be understood as a condition of ‘asymptotically 
nonvanishing correlation’ between partially unobserved variables M,, and 
instrumental variables W,, (with W,, = R(W,,)); in case n(m) = n, m = mM, 
this corresponds to the condition (Id) in (LIFU-), (IV) of Section 3.3.4. But 
since dim @,, ———>+ co is admitted, (C1) can not be construed as a direct 
counterpart of (Id); (C1) is weaker than such an assumption. 

Under (V,), (V2), condition (C1) suffices for the consistency of the CIVE 
of £, since already sufficient information on P® is available (cf. Section 3.3.4. 
(A)—(C)). Indeed, under (V,), (V2) an instrumental variable is not necessary 
for consistent estimation if (C1) is satisfied for W,, = R™, m = mp (see above). 
But under (V3) condition (C1) is in general no longer sufficient. The condi- 
tions (C3) guaranteeing under (V3) the consistency of the CIVE for f are 
specific for the present model. They represent the counterpart of the condition 
(Id) in (LIFU-), (IV) for the present case of generalized nonrandom instru- 
mental variables ({@n}m>m,.m! dim ,, ——=> « € [0, 1]). Condition (C3) (a) 
excludes @,, = R™, m = mp; it may be considered as necessary to obtain the 
required information on 2;, (C3) (b) is a regularity condition which restricts 
the ‘asymptotic deviation’ from the adequate model (with (A)). In the ade- 
quate case these conditions just mean consistent estimability of 2; (cf. Lemma 
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3.4.3). But in general consistent estimability of 2; is not necessary for the 
consistency of 2¢m- 

Compared with (C1), condition (C3) (c) can be considered as the genuine 
analogue of the correlation condition (Id) for the case of increasing dimension 
of @,. Indeed, condition (C3) (c) remains satisfied if {u;};oy is substituted 
by {ui + ¢iicq for a sequence of independent identically distributed R?- 
valued random variables {d;};-y. But such a property could be required for 
a condition about {wi};e_ and {Wy} nsm,. Which would correspond to (Id). 

In the case « = 0, i.e. with ‘essentially finite’ dimension of Y,, for m — oo, 
(C3) (c) becomes (C1) and can then be interpreted as above. In the adequate 
case, (C3) (c) becomes (C1), since then Hy, = 0,,, holds. Condition (A) 
represents a counterpart to a condition of ‘full correlation’ in the model (LIFU-), 
(IV): 

Day 


ww 


=='0) 


PXp* 


Under (A) condition (C1) is, as above in the case Y,, = IR™, a regularity condi- 
tion in the model. 


3.4.3 Examples 


In this section we discuss several special cases of the model introduced in 
Section 3.4.1, as well as the meaning of the results obtained in each case. At 
the same time the motivation for the generality of the model will be clarified, 
as several types of a linear functional relationship, some of which were already 
introduced in the preceding sections (3.1.3, 3.3.1), are included. 


Example 3.4.1 (Haxplicit case) As the further results in the model (limiting 
distribution, asymptotic efficiency) concern the explicit case, we give an 
equivalent formulation of the model and of the assumptions, making use of 
the simple parametrization possible under (Ex). By virtue of (iv) we see that 
A(M;,J+) S W,, is equivalent to R(Nj_) GW». The convergence assumption 
in the definition of M° is equivalent to m 1N»Py Nm asa? H. In view of 
(viii) the remaining model assumptions can also be expressed in terms of N,,. 

The formal definition of the model which coincides under (Ex) with the 
one introduced in Section 3.4.1 is the following: Let 


Re = {Edi lf —liiail, &i€ Ree, & €R-1, i€N, 


aH € MF: A(((Ard)inn,...m)) FS Vn, mS mo, 


and 
Or (PF, Bete); PUPS wB © MGS ey eee Oo os 


ee 
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Let {P, | 0 € O} be given by 

(Zidieon = (Lakition + (Chien ~ Py 

Sihiew ~ (P*)® 

ONSITE S Max w—a ee ae 


Here {Esi}icw 18 a sequence of unknown incidential parameters which satisfies only 
a convergence assumption (cf. Remark 3.4.1). The equivalent formulations of 
the additional model assumptions (with VN, = (€;);-1,..m) are: 


(A) AN) SW p 
(Cl) fHi=p—q 


(C2) mh ,,N;,—> py 
(C3) (a) m — n(m) => 00 


(b) (m — n(m))1N,Py+N;, >> Ha 
(e) fH —oHy=p —4 


(C4) m1N,1_,— ag 
(C5) max m-! ||NnPw ey” ||? soatgall 
1Sism 


Example 3.4.2 Let us consider the case of the trivial IV-space @,, = R™, 
ie. n(m) =m, m = m. Then (A) is satisfied, and under (V,), (N), » = 1,2 
the CIVE is the MLE. Condition (C3) (a) is not fulfilled; under (V3) the CIVE 
is not defined: 


Example 3.4.3 Let us consider the case of an IV-space Y,, with fixed dimen- 
sion: n(m) =n, m = mp. The conditions (B,) are satisfied for sufficiently 
large m; hence Bom exists. Let Win € Mnsem>s R(Wy,) = W,m => mo, and under 
(Ex) let the following assumptions be fulfilled: 


(C6) mW Wo — > Ss Cty 


m m—>co 


(b) m4N,,W,, ss TE MES 


™m m—co 


(C7) max m-! ||W,,e!||? +0. 


1Sism 
Lf, in addition (C2) holds, then, as can easily be verified, the conditions (C1), 
(C3), (C5) are fulfilled. We then have G = T5717". 

Condition (C6) (v) represents the asymptotic counterpart of (Id) in the 
model (LIFU-), (IV) of Section 3.3.4. Thus Bo, is always consistent. Under 
(V>) Gm is also consistent; in the adequate case this also holds for 6@,,. Under 
(V;), in the adequate case Lem and S') are consistent for 2;. 


mo 
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Example 3.4.4 In Example 3.4.3 let n = p —q. Then we have 7[Qz_.w,] 
= p — q, and according to Theorem 3.3.1 (b), (c) one obtains 


Lom at R(Qz,.w,,)= R(Zin Win) - 
In view of 7[Qz, .w,-v.,] S7 — 7 (Lemma 3.3.2), Remark 3.3.10 implies 


Boom = 0" R(Qz,,,-Wm Nip)) 
= ¥,,(Pw,, — Pr'yn) Xtm(Xen(Pw,, — Px.) Xam); 


Buca ae VisNin ay BycmXamN tin 


in the notation of (xiv)). This implies that the CIVE and the 2SLS-estimator 
(defined for the adequate case in Section 3.3.1.5) coincide (cf. (3.3.20)). 


Example 3.4.5 (Linear regression model) Let r = q. One obtains (as (Ex) 
is always satisfied, cf. section 3.3.4) 


Y, = BN, + 8%, 


with known JN,,. In view of Remark 3.3.6 we obtain from Theorem 3.3.1 (a) 
that for any possible choice of the [V-space &@,, the CIVE for B, if it exists, 
concides with the BLUE for B (cf. Bunke and Bunke, 1986, section 2.1): 


A 


Bom SS in ZomM in <a YANG: 


since B = B, (cf. Section 3.3.4 (ii)). For the choice @,, = A(N,,) the conditions 
reduce to 


mN,,Ni, Perera g He (0) 
Hence the linear model is a special case in the sense of Remark 3.3.6. But, 
since the following results are without interest for this case, we will not mention 
it any further. 


Example 3.4.6 In Example 3.4.4 let p = 3, r= 2, g=1. For N,, = 1,, 
m = mg one obtains a bivariate inhomogeneous explicit model (LIFU*) 


Bo eG, 
yi; = B, + Bofoi + 53;, t=1,....m 


(for OP pester = Yo (@i)i—1 ed 
((e4; | EDsat,..m = % in (i), (xiv). Let an IV-matrix Up = (u;)jo1,m€ Mism 


COO 1 Cem ccc. MUG aati on! MO Thais AP, Seine ainsi ea sion, re (A. ema ae 4/0 dl SA 


with the property R(U;,) + A(1,) be given. Then, with Y,, = Rn | U;,)) 


Un, ae (UF )i=i 


> 


m i= Un, — P1,) for the CIVE we have according to 


penne 
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Example 3.4.4 that 


Boom = VP ot X3m(XomPo*Xom) 2 


m ™m 
* ok 
= LYitin | Yi Litin 
i=1 i=1 


A 


aM =. ie A = 
Bicm =e Ly Zam 1,m = Dm Byon§m 


m m 
(=. vt Sees Yay et un) Thus the CIVE is the simple instrumen- 
i=1 i=1 


tal variables estimator already introduced in Section 3.3.1.3 (cf. (3.3.9)). 


Example 3.4.7 Let us consider 4 model with (Ex),7 = p —1,N,, = 1), 
m= mp, (inhomogeneous model, cf. Example 3.3.1). Let an IV-matrix 
Wa € Men myrcm be given by 


Wim = Diag ieee eee Licatnim) | 


Thus W,, gives a grouping of the observations 2;,7 = 1, ..., m into n(m) groups 


n(m) 
with k(z7, m) elements each s k(t, m) = m}). In the adequate case a ‘model 
i=1 
with replicated observations’ (cf. (3.2.1)) results. Mose generally, the grouping 
is derived from additional information on (,;);-1,..m> @-g- in the caser —q = 1 
from a known rank statistic for (3;);-1..m (Ware, 1972)). The condition 
n(m) = p — q (ef. Remark 3.3.3), ie. n =r —gq-+ 1, states that there are 
at least as many groups as there are points needed to determine an (r — q)- 
dimensional affine manifold. 

With this example the chosen generality of the model of Section 3.4.1 and 
in particular the assumption a € [0, 1] can be justified more clearly. With 
this, in addition to the case of constant group number ”, the case of increasing 
group number n(m) ———> oo is admitted, and with « > Oalso the case of constant 
allocation number (k(i, ae; tee (1) 5 A mo). Heuristically stated, 
the latter case obviously represents a ‘less favourable’ or less regular model 
with respect to asymptotic theory than the case of fixed n, due to the presence 
of infinitely many unknown incidental parameters. In situations of real appli- 
cation the choice of a model with « > 0 means that a ‘perturbation’ of a 
model with finite parameter number is considered, which means reducing ideali- 
zation. 


Example 3.4.8 In Example 3.4.7 let n(m) = p — q, i.e. the number of groups 
is a minimal one. According to Example 3.4.4 one obtains 


o(Brcm) ae R(Lom(Pw,, er P,,)) 


== Rl (Zim care Zee ta) 
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Byom = Y..m — Bec Bm 


(Fn = (b(é, m))-2 YS 2, Ili, m) := {ren aa 
— 


jeI(i,m) 


é m 
= Da k(1, my}, oe Ino ys is Wins hm analogous) : 
l=1 


i=1 


) 


Geometrically this means that the affine manifold described by (o(Bocm), 2a 
== 3) 


(cf. Example 3.3.1) is set just through the p — q group means. For p 
q = 1 one obtains the grouping estimator of (3.3.7). 


Example 3.4.9 In the model of Section 3.4.1 let, in addition, a Wpm € @np—¢> 
R( Mim) <— Wem <= Wm» be given. Then one can consider the CIVE Ppm Cor- 
responding to @p, as described in Definition 3.3.1. It always exists for 
sufficiently large m. Let Wm € Mn) ms Wem € Mths qx nim be such that 
Wm = R(W),), Wem = R(W),W p,). According to Example 3.4.4, then 


Lon = R( Zin Win W pm)? 


This yields a class of alternatives to the CIVE #¢,, for @,,; admitting certain 
random W?,,, with constant n(m), one obtains the class of ‘ordinary estimators’ 
of Villegas (1966) (cf. Section 3.4.6). Applying this procedure in the case of 
Example 3.4.7 for r = 2, ¢q = 1 yields 


n(m) n(m) 
°. Ay * = * 
Bopm a Sy Yi.mUim | Li.mUim 
x 


i i= 


for, Ut, = Wh )ierntw € Vikan ATL) = AWS (Py. — Py). This 
means applying a procedure of simple [V-estimation according to Example 
3.4.6 after a preliminary data reduction by averaging (cf. also Section 3.5.1). 


3.4.4 Asymptotic normality under normal distribution 


Let us now investigate the limiting distribution of the canonical instrumen- 
tal variables estimator for £. We confine ourselves to the explicit case, in which 
an asymptotic normal distribution can be proved for the (suitably normalized) 
estimator of the R%-9-valued parameter B. The CIVE Be, depends on the 
observations over S,, and Q,,. Here the general form of Q,,, in particular the 
general from of the [V-space @,,, requires a restriction to normal errors. 

As a first step we show the asymptotic normality of the (suitably normali- 
zed) matrix Q,. This result forms the basis for proving asymptotic normality 
not only of the CIVE but also of the alternative estimators, to be considered 
in the next section. 
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Lemma 3.4.8 Under (V,), (N) we have 


L£{m'¥2(On "t Gn) ae Nx pOpsxps At) 


moo 
i 8 
ae = 2o(L'¢ ®& dt) OF ate 40,(G9° & 2) L ) : 


n(m) 
Proof. Note that 9 = ae Let Py, = >) CimCjm be a spectral decomposition 


‘—1 
of Py. Since H,Q%2 = = G@), we have 
m!2(Qom ie: ae Saas m'2(Qom ae E5Qom) 
n(m) 
ars me yy (ZimCimCimZ m cars EZ mCim@j, alae (9) 
i=1 


Let A“) be the covariance matrix of (9). Since the summands of (9) are inde- 
pendent random variables, 4%) can be calculated using [A 3.19]; one obtains 
AG) ——— Ae), 
mt m—co 
Now it suffices to show that for all K€M,,, the expression m1? 
x tr[K(Qom —Gn)] either almost surely vanishes or is asymptotically 


distributed as N(0, K’ AWK). Here one can restrict oneself to K € My. Let 


D := (Py. + 5)¥? and let 


Pp 
DED = ¥ Addi, 4, € RY, ¢ = 1,::., 0 


i=1 


be a spectral decomposition of DKD. Then it suffices to show that for 
i=, oa eatery 1 


n(m) 
mV? 2 ((d; DZ nCim)® — Ly(d;D~'ZmCim)”) (10) 
— 
either almost surely vanishes or is asymptotically distributed as 
N(0, (d; © d;) D+ A D-*(d; ® d,)) (because of the independence of dd ;,D1Z » 
and djd;D"Z,,, k + 9). Now, if D-*d; € J+, then (10) almost surely vanishes. — 
n(m) 
If D-1d; ¢ J+ holds, then Y) (d;DZyCim)® 18 %nms,,-Aistributed with non- 
centrality parameter i= 


n(m) 
bm = Y (G;D Mn Cim)” 


i=1 
= 4,D"M,Py, MD \d;. 
Hence (10) can be represented as 
n(m) 
m2 > (yf — Byi) + mn + ome)? — BO + Om); 
i=2 
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where {yj};eqy iS a sequence of independent N(0, 1)-distributed random variab- 
les. Here, in case « > 0, the first summand is asymptotically normal; in the 
case « = 0 it converges to zero in quadratic mean. The second summand is 
equal to 

m-Vfy? — By,)® + (26,m-1)¥? y, 


and hence, by virtue of m~,, ———+ d;D“1HD-'d, =: 6, it converges in distri- 
bution to N(0, 4d). Hence (10) is asy Tpeouoally normal, where the limit of 
the variance equals the variance of the limit distribution. Since this is also 


true for (9), the assertion follows. 


~ 


For the model assumption (Vs) one obtains 

Lemma 3.4.9 Under (V3), (N), (C3), 

£{(m — n(m))¥? (SY — 2, — Han) oY Want ventas se 
where 

A'$} = 2(2, © Zr) My + 40 (Hs ®@ Zr) Ty. 
Proof. The proof is analogous to the proof of Lemma 3.4.8, if we normalize 
with (m — n(m))¥? = (dim #4)? and observe that R(S,,) = J holds almost 
surely. 
_ In the following lemma we give a representation of the CIVE Bo, that is 
useful for the proof of asymptotic normality as wellas of asymptotic efficiency. 
Lemma 3.4.10 Under (V,), (Ex), (C1), vy = 1,2 or (V3), ((Ex), (C3) we have 


Bom =a oY R(QmEm)), m= mM (11) 
for 

Coes nee + Sx) Fe La. (12) 
where 

Ch pi (P31 + Sp) Fa'Ls (13) 

Gag)! = 2;0(B)*. (14) 


Proof. The pose goe (11), (12) ) follows immediately from Theorem 3.3.1 (b) 
if we put C,, = = BomEom and om = = A(Lg,.) = (oe Indeed, by Theorem 
3.3.1 (c) Qom can be replaced by Q,, in (b) (for A = n(m) m-(63,, — o?) under 
(V.)). The convergence (13) results from Lemma 3.4.5 (b), Lemma 3.4.3, and 
Theorem 3.4.1. in conjunction with Remark 3.4.2. Taking into account 
Ey € Mex» by Lemma 3.4.5 (ce), we obtain 
ROys)* = (Fg (Pax + 8p) Fs") o(B)! 

= P,(Pz. + So) Fxo(B)t = A PePs.F gL} + SoL4) 

= R(FgP5iJL}, + 2L4) = Z:0(B)* 
in view of [A 1.6a)]. 


Now we need a convergence statement on ji More 
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Lemma 3.4.11 Under (V,), (N), (Ex), (C1) we have 
Bom — B = Op(m-1?). 


Proof. Let us consider the representation of Bom according to Lemma 3.4.10, 
which may be written as 


RL...) = R(OnEm)s 


or, since R(O,,Cn) € OM,xip—q) is equivalent to [LiOnEn) = p—q by 
[A 1.3] and part (b) of [A 1.6], as 


Lt... = QnEn\LQmEn)?- 
Consequently 
mL Lg = m'2(Bom — B) 
= 0}? OnCn(LoQmEn)? 
= mL! (Qn — LeGnl's) En(LOQmEm)*- 
The assertion now follows from Lemma 3.4.8 and from 
En(LQmE mn)? > Eoo(L 600)? G1 
according to Lemma 3.4.7 and Lemma 3.4.10, where L',Cool =p —q isa 


consequence of (14). 


With this we can now show asymptotic normality of m1'2(Q@) — L,G\?L;). 
Let 
Ry, := m8z,.w,, —(m — n(m)) m2, + ony (15) 


Lemma 3.4.12 Under (V,), (N), (C2), » = 1, 2, 3, we have 


L{m?Ry} meer VK Ue Ae) 


where 


~ 


Mo, = 2(1 — x) (Zp @ Zt) Pp + 40 ((Ho — H) © 2) Lp. 


Proof. For m = n(m) 
J'Sz,.w,J ~ Wm — nlm), 2S, Py+M mn) 


and 
mM *MomPy+Mom = (m —n(m)) mA am = Ayn — Am a? Ay —Hi 


holds by (C2). For m(n) = n, Rp vanishes. With [A 2.11] the assertion follows, 
where Nyy. p(Op xp» Op? p2) is understood as a one-point distribution. 


22* 
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Lemma 3.4.13 Under (V2), (N), (Ex), (C1), (C2), we have 


L{m'2(0,, cai LpGinl'z)} FSA Nox ppp» AS”) 
where 
A? = 2o(Z; @ ZX) Ip — 2x2q7((Z; @ Zr) JT Se Te x;)) 
+ Qo®q 322, + 40 y(LgG Ls @ 2) I. 
Proof. First, by Remark 3.3.7, BY, = BY), holds for sufficiently large m, 


if BY), is the CIVE for V, = {o7X} and o? the true parameter. Furthermore, 
we have 


O® — L,G°L, = QM — L,GYL', — n(m) m-\(62_ — 02) Z 


because of GO) = G® + H,,. The form of 6%, according to Theorem 3.4.3 for 
Lom = o(BY,) = o(BY wy )) in H,, implies 


Gtm — 0 = gt tr [MD n(mQz,, — Xr) 
= 73 tr [En Om] + 7? tr [IT (m8 7,27, 


— (m — n(m)) m+ z:)] ; 
Now . 
m2 tr [Tn LpH L's] 


= tr [mi(BYy — BY) (LY 21+) (BY), — B) Hy] 


for — Ligay- By Lemma 3.4.11 and Theorem 3.4.1 the above expression is 
op(1). Similarly, since 


(m — n(m)) m+ Han = Hom — Hy, 
(see also (v), (viii), Section 3.4.1) and since (C 2) is assumed, 
tr [IZ,,(m — n(m)) m-? LgH 4nL',| = op(m-1?). 
Hence, with R,, from (15) 
QD — LGD Ly = (Lp — n(mm) (mg) E1Y,) (QD — LyGOL5) 
(mm) (mq)* ELT gn + op(m-2) 
= (Ly. — nm) (mq) Z-17') (QD — L,09E5) 


4. n(m) (mg)? ET'Ry, + op(m-"?) (16) 
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in view of Lemmas 3.4.6 (a), 3.4.8, and 3.4.12. The independence of Q 
and R,, implies the limit distribution for the above expression; we calculate 


ag SIT’ AY = 202g! BIT (Ze @ Et) + 40q-? E-DLILZGOL’, 
= 2aqt EIT’ (Z; @ Ei) 
xg? S Il’ AVITE, = 2odq?2E,E% tr UIE IE] = 2u8q 3B, 
og? IT’ Ayers! = 2(1 — «) og 3d. 3, 
and thus we obtain 4?). m 
From Lemmas 3.4.8 and 3.4.9 we directly infer the result for Q®). 
Lemma 3.4.14 Under (V3), (N), (C3), we have 


L£{(m wi n(m))? (Q°) ar Go) Tso Nox p(Opxp: A$”) 


where 
A?) — (1 pt «) AY as «2A3) 


= 2a(Z, © 2) Ip + 40, (GO © ZX) L; 


With these results we are now able to establish the asymptotic normality 
of the CIVE Boy. 


Theorem 3.4.5 Under (V,), (N), (Ex), (Cl), » = 1,2, or (V3), (N), (Ex), 
(C3), we have 


LE (mi m))"? (B, — By eee EG No Onxo- 0)? Dos) 
where 
Dog = «G-L5(Z; + Lgl yg) Lp) G1 © Ly’ 2c Lz 
+ G\(Gy — oly) G1 @ Lt 2,Lt. 
Proof. Let 


CF := Cre b Coy 


for Oy» from Lemma 3.4.10. As in Lemma 3.4.11 we use Lemma 3.4.10 to 
obtain the representation 
(i(m))"? (Bom — B) = L4{70(n))"? (On — LeGmL) En(LoQmOm 
Fi 
From (Ona)? —.> C*G-1 and from Lemmas 3.4.8, 3.4.14, and Remark 
3.3.7 the assertion of the theorem follows for 


Dog = (G-10*" © L5') Ag(C*G © Lz). 
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The starting point is Lemma 3.4.7. It states that the matrix Q,, has a singu- 
lar probability limit: 


On > LpGLz EM, MGl=p—g, (18) 


m—>oo 


which by B= 0(R(L,GL%)) identifies the unknown parameter B. Now it 
can be shown that 7[Q,,] = p holds for m = mp. 


We consider the problem of consistent estimation of B on the basis of (18). 
Let 


fs Re@+vr _, Rua 
be a function with property 
{UL xGL) = B, W(B,G@)« Moxip—q X Mpa 1G] =p — 4. (19) 


Thus the function f applied to the p(p + 1)/2 different elements of a symmetric 
matrix Q (i.e. to [Q, cf. [A 1.10]) always yields B if A(Q) = o(B). If f is 
continuous in an open neighbourhood 4* of 


A= PLpGLy | BE Moxip—g> & € Myo AG] = p —H, 
then an estimator B,, of B is obtained in a natural way: 
By = (0;QOn). (20) 


Because of (18) and the continuity of fin an open neighbourhood of LCi. B,, 
is consistent. If f is additionally assumed continuously differentiable on 
A*, then the asymptotic normality of (i%(m))? (Q, — Lg@,L’;) and [A 2.10] 
can be used to derive the relation 


> 


Boe Bb =f eh On LCL + op((i(m))-¥”) , (21) 


where df denotes the total derivative of f. Now, differentiation of (19) with 
respect to B yields 


df(P,LeGL,) 030 ,LeGL, = Lyyp—ay> (22) 


and differentiation with respect to g, if we put G =f mP RA AE Re-V7—a+/2 
yields 


df(P",L2GL 5) Ol’, L2GLy aye Ogp—a)x (p—ap—at 12 * (23) 
Using a perturbation expansion, one easily calculates ézf,L 2GL',and gl, L Pan 
For this purpose on puts B, = B+ AB, gg =9 +4, and Gy= G+ AG, 
(for I"), g@ =% I’, qG% = 9) and by taking into account (3.3.89) one gets 


o 


\ 
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L3,G,L3, = LpGLy + AL,GBiLA' + LB,GL’, + L,G,L';) + O(22) and 
hence 


Og LeGL, = Pil(Lp@ @ Ly) + (Ld © LeG) Lip gas] 
= 2f(Ln6 ® Le), 
Of LnGL, = f(Ly @ Ls) I',-g 
Inserting these expression into (22) and (23) yields the linear contraints on 
af(l', LpGL;,). 


Now relation (21) in conjunction with (22) and (23) suggests a class of estima- 
tors which contains those of the form (20) as a special case. The definition 
again relates to the general model according to Section 3.4.1. 


Definition 3.4.1 Let the model {P| 8 € O} be described by (V,), vy = 1, 2, 3, 
(Ex), and further assumptions entailing (18). An estimator (Bea of the 
structural parameter B is called an asymptotic Qm-estimator if there exists a 
mapping p: O —> Mapa) x pip+1)2> P(P) = Cy 80 that 


(a) By — B= OLOn — LnGnbh + of((ifm)*), 6 O 
(b) C,4, = Jo, o€O 
for 
To 2= (Lqip-a) | Sq¢p-a) x (p-antn-a1) 2) 
Ay := (26 ,(L2@4 @ Ld) iP (Le @ Lz) P,.). 
First we show that the matrix A, of the linear contraints on CO, has full rank 


Lemma 3.4.15 Under (V,), (C1), » = 1, 2 or (V3), (C3) we have 


(p—Q)(P—q+1)/2+9(p—9) 
E Mp + /exip— D(p—q+)/2+q(p—-Q1° 


Proof. From the above calculation of Ay using perturbation expansion we 
see that 7145] < (p — q) (p —q + 1)/2 + a(p — q) implies that there are 


BYE Max (p-a)> FE Mpa, 

[B*, G*] + Op-aidp-- 80 that 

L,GBY LM + LABRGL', + LyQ*L'y = Opyp- 
Premultiplying by L;’ yields 

BYGL,, = 07255 


which implies B¥ = 04y(p-q)> G* = O(p-q) x(p- This is a contradiction. ™ 
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Now it turns out that the CIVE (Bom) asm is an element of this class of 
estimators. 


Theorem 3.4.6 If (V,), (N), (Hz), (C1); or (V2), (N), (Ex), (CL), (C2); or 
(V3), (N), (Ex), (C3) are satisfied, the CIVE (Bow}m=t, is an asymptotic Qn- 
estimator. Here we have to put Cy = Cog, Cog being defined by 


Cop 2= (G-(Co Lp)" Coo & Ly) is 


with Oo, from Lemma 3.4.10. 
Proof. As in the proof of Theorem 3.4.5, one obtains the representation (3.4.17) 
and from this 


> 


Bom — B = ((C,QnLo)-t C7, @ Le’) Fp Qm — LaGnbis- 
Now the property (a) of Definition 3.4.1 follows from 
((CQmLo)* Cp, ® Lz’) Py — Cop = on(1) 


and from Lemmas 3.4.8, 3.4.13, and 3.4.14. To prove the property (b) we 
calculate 


Cool’ y(LaG @ Lt) = (G-"(OopLg) 3 Cog © Lh’) (IgG © Lt) 
+ (G(GooL)-! Ooo @ L5') (Lé @ LpG) Ip-a.a} 


= Oq(p-9) x (p-g)(p-a41)/2> 


Remark 3.4.5 Theorem 3.4.6 illustrates that the class of asymptotic Q,,- 
estimators is not restricted to estimators based on the construction principle 
described in the beginning. Indeed, Theorem 3.3.1 shows that Be,, is not only 
a function of Q,, (compare also Remark 3.3.8). Further examples of asymp- 
totic Q,,-estimators will be discussed at the end of this section. The limit 
distribution statement for asymptotic Q,,-estimators is now a simple conse- 
quence of the definition and the asymptotic normality shown in Section 3.3.4 
of (fi(m))! (Qn — LnGnLis)- 


Theorem 3.4.7 Under (V1), (N), (Hx), (C1); or (V2), (N), (Ex), (C1), 
(C2); or (V3), (N), (Hx), (C3) we have for an asymptotic Q,,-estimator {B,,} 
satisfying Definition 3.4.1 for Cs that 


£4 (i(m))" (Bn aa B)t Gear Nax(p-q%qx(p-9)» DF({Bnu}m=m,)) ? 


MEM 


GE Sa 24 
DB ere) an CL Al Cs ( ) 


with A, from Lemmas 3.4.8, 3.4.13, and 3.4.14. 
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The notation D%({B,,} m>m,) Shall henceforth be used for the covariance matrix 
of the limit distribution of (i%(m))¥! 2 (B,, — B). The following discussion of 
asymptotic efficiency relates to optimality with respect to D%({B,,} 


m=m,) 


Definition 3.4.2 Let the model {P»|  € O} be given by (V,), » = 1, 2, 3, (Ex), 
and further assumptions which entail (18) and (24). Then an asymptotic Qy- 
estimator (Bis is said to be asymptotically efficient if for each asymptotic 
Qn-estimator By ei 


DB eon) SD Belo), . PSO. 


In the treatment of asymptotic efficiency we distinguish now between the 
cases « = Oanda > 0. 


Theorem 3.4.8 Under (V,), (N), (Ex), (C1); or (V2), (N), (Ex), (C1), 
(C2); or (V3), (N), (Ex), (C3), in the case of « = 0 for arbitrary asymptotic 
Qn-estimators Lea ae ia Pao we have 


B,, — B*, = op( (i#e(m))-1'2) ‘ 
Proof. Let (Bos, and (Bea, be estimators satisfying Definition 3.4.1 
for C', and C%, respectively. Then B,, — B*, = (Cy — Ct) P(Qm —- LpGnL',) 
4 op((i4(m))-™?).. Hence it suffices to show that 

(Cy — C5) 1 AL, TE Oq¢p—a) x p(p+1)/2* 
For this it suffices to prove 


RPA Ip) S RAy). (25) 
Now 
Ay = 40, (LeQoL'g @ 2) Tp 


by Lemmas 3.4.8, 3.4.13, and 3.4.14, and 
PAL, = Fi (Lz @ Ip) (Gols © Xr) Py, 
RAs) = ALP, (Ly @ Lj)) + RL ,(Le @ Ls) Py) 
= KPi(Ly ® Lg) + AL; (Le @ Ls) 
= Pi(Ly @ Ip) (RLp-¢ @ Lg) + HZp-q @ Ln))- 
To prove (25) it is sufficient to show that 
KM (Lz ® Ex) Fy) S RI p-q ® Lg) + Ap @ Le); 
and this is satisfied because of 
HRI p-q @ Lt) + RIp-q © Lp) = AIp-q © (Ly 1 Ls)) = RPP. 
With Theorems 3.4.5, 3.4.6, and 3.4.8, and Remark 3.4.3 (a) we obtain 
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Corollary 3.4.1 Under the assumptions of Theorem 3.4.8 for each asymptotic 
Qn-estimator (Bees we have 


D}({Bu}mzm,) = FG G7? @ Ly’ ZeL5. 


Let us now consider the case « > 0. Here the asymptotic Q,,-estimators are 
not asymptotically equivalent in general (in the sense of Theorem 3.4.8); then 
the structure of the limit covariance matrix in conjunction with the constraint 
(b) from Definition 3.4.1 allows an optimality statement. 


Theorem 3.4.9 Let the model {P, | 9 € O} be given by (V,), (N), (Hx), (Cl); 
or (V2), (N), (Ex), (C1), (C2); or (V3), (N), (Hx), (C3), and let «> 0. Let 
i bmm, be an asymptotic Q,-estimator satisfying Definition 3.4.1 for Cy. Then: 


a) If Cy = Coy, B E O (Cog sige Theorem 3.4.6), then {Bin bnzm, 18 asymrptotically | 
Sasi 


(b) Cy = Cog, 3 € O is necessary for the asymptotic efficiency of 1 Batis under 
(V1), (V3) and under (V2), « <1. 


(c) If {Bn}m=m, is asymptotically efficient, then DP({Bp mzm,) = Dos (Dos from 
Theorem 3.4.5). 

Proof. For fixed @€ @ we consider the problem of minimizing ChAT GC: 
C € Ma(p-—q) x p(p+i)2 under the constraint CA, = Jo. The assertion (a) is proved 
if Cog is the solution of this problem for all # € O (Theorem 3.4.7, Definition 
3.4.2). Then (c) follows from Theorems 3.4.7, 3.4.6, and 3.4.5. Assertion (b) 
is proved if under the given assumptions the solution is uniquely determined. 

From CA, = Jo we obtain 


CEUs) La) to aes One neon aie (26) 
which implies 
Omens Ol, p(Lg @ Lg) Pp-q( AL, © AL) 
for each A € M,-. Let S = E; + LyL',; then & > 0 and 
CP Al C= CP (Ay + 2a0lgh, @L,0,) PC’ 
= OF [Ay + 20LgLy @ Lg Ly + 405 (oI pg — Gy) Ly © Zr] f',C' 
+ 4CP [Lz(Go — «I pq) Lz @ SX) P,C'. (27) 


Let &®) := 2a?q-'I)(v) for » = 1, 2, 3; write & in the sequel. Using Lemmas 
3.4.8, 3.4.13, and 3.4.14, we now show that 


P45 + 2oLgLy @ Lal + 4Lp(oLy-q — Go) Ly @ Xe), 
= 2h (£@ £) fF, — af (EX @ 2) + (2 @ 2) TE) F 
+a, 25 Ly, (28) 
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where the relations We =e a: 22, =Ip+T1 (p} and the properties of J (P)} 
([A 1.10]) are exploited. Analogously to the proof of Theorem 3.4.8 and Corol- 
lary 3.4.1 it can be shown that a 


409, (Lp(Go — oD pq) Li, © E;) £0" 
= GG) — oly) G-? @ LM E,L4. (29) 
From (27), (28), and (29) we obtain 
CEA 
= OF [2aF @ F — a( LIT (SZ, @ 21) + (2, @ EH) WE) 
+ 62,27) 0,0' + GG) — oly_q) G-* @ LY E,L4. (30) 
Furthermore, it follows from (26) that 


Cr, Lpl's = Oq(p-a) x1» 
CES, =0Ts (31) 


Cf (E @ E) TT = CF (E12 @ E12) Pynys 
= OF (S02 @ £12) (I, — Pgang) 
eo Of (ls @ Ln (bes 
Ope. (32) 


Observing that (2; ® 2) ie ( @ £) IT one thus obtains from (30), (31), and 
(32) that 
CP Al’ ,C' — GG — oly») G-! @ Lt ZL} 
= Of [205 @ E — a(FSI(E @ £) + (EF @ LZ) ME) + EE F,0' 
= GAC! 
for : pas ie 
A := 2f[S @ F — &(2x)3 (LF @ 2) MINS @ Z)L>. 
Note that 
A me 2af', (S12 @ U2) 
x ia &(2cx)-2 (S12 @ Suey [TTT (S32 @ FV2)] (S2 @ SU2) fe 
= 20h" (EU? @ LU2) (Ip — g&(2a)-* M1] (S12 @ Eu) F,, 
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where 
Tf := q-US22 @ S12) TIT (E12 @ Fz) 
is a projection matrix, as can easily be verified. With (V,), v= 1, 30ra <1 
we have A > 0. For 
Cy = O41? 
Ay = A-we Ais 


C,,C;, is to be minimized under the constraint C,A, = Jo. A sufficient con- 
dition for minimization is 


R(O;) S R(A,), CyAx a Jo 
or 
RAC) GRAs), CAp =p: (33) 


Because of Lemma 3.4.15 and 4 > 0 the solution in C is uniquely determined. 
Under (V2), « = 1, we have g&(2x)"* = 1. Then let I7* € M,»,(p»4) be such 
that JT*II*’ = I,, — II. We have 


M(Z-¥2 @ T-U2) PA» 

= M(S-? @ X-1”) P, (Lp @ Ip) (26 @ Ly} (pq @ Ln) F5-4) 

34 Ope x [(p—a)(p-g+1)/2+9(P—9)] (34) 
Hence the constraint CAs = Jo can be written as 

OPEN? @ 212) (Le 1) (2-2 @ 2-42) FAs =o. 
Defining 

Cay = OF (E12 @ 51) 17 

Aye = 11? (2-8 @ 2-1) TAs, 


we see that C,,C,, is to be minimized under the constraint Cy, Ay. = Jo. 
A sufficient condition for minimization is 


KC.) S RAxy), CxrAnx = Jo 
or equivalently 
HIT*C i.) S RIT*A y), Cx Aan = So- 


For this (33) is sufficient because of rfl = aap and (34). 
To prove the theorem it now suffices to show that Coy fulfils (33). For Os 
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from Lemma 3.4.10 we have 
RAC) = RAT, (Eos @ L4)) 
= AP (SU? @ LM) (Ips — qo(2x)-? M1) (S169 @ S12] )). 
Now 


T(E2?Oyg @ ZV2L4) = g- S12 @ E12) LA’ EEC’, = 0 


p? x q(p—-q) 
because of 2(Gi,) = RLZ,LR) = R(SL4) according to Lemma 3.4.10. Hence 
RMA Ops) = RT (ZCoo © ZL4)) 
= AP; (Lz © Z4L)). 


The proof of (33) now proceeds as in Theorem 3.4.8. By Theorem 3.4.6, 
CosAe =S Jo. i 


With theorem 3.4.6 we immediately obtain the following result. 


Corollary 3.4.2 Under the assumptions of Theorem 3.4.9 the CIVE (Bohs ma, 
is asymptotically efficient. = 
This is a useful optimality result since the most important alternatives to the 
CIVE are also asymptotic Q,-estimators. 


Example 3.4.10 (Modified two-stage-LSE (2SLS)-estimators and related ones) 
In Section 3.3.1.4 in a special model we already mentioned the estimators 
(3.3.18) and (3.3.19), which are alternatives to the CIVE. Here we consider 
the generalizations of these estimators for the present model; it turns out that 
these can be construed as analogues of the 2SLS-estimators in simultaneous 
equation models, whereas the CIVE corresponds to the LIML (cf. Sections 
3.1.3, 3.3.1.5). 

First we consider the case of r = p. For the function f in the initial con- 
struction we set 


(PA) = | e-(R(ALy)) if R(ALo) € o(Mg x (p-a)) 


arbitrary otherwise 


(35) 


Since R(L,GL,Lo) = A(LzG) = o(B) if r[G] = p —q, this function fulfils 
(19). Moreover, f is continuously differentiable on an open neighbourhood 
A* of A; indeed, f may be represented as 
ee LgAL(Lp Aly)? if LAL] = p — 9 
ina-{% 
arbitrary otherwise 


Now, as 
A* — {fF A, A EM, | [LoALo] = p — 4} 


t 
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is open, f is continuously differentiable there, and 4 A*, the assertion 
follows. If we put 


Baym = HLOm) » 
then the estimator Bim is an asymptotic Qm-estimator under the assumptions of 
Theorem 3.4.6. 
Here Ly can now be replaced by an arbitrary matrix C € Mh x4,_,) if only 
parameter values are considered for which 7[L,C] = p — q. In particular, let 


Co Ems (pg) be--the ( z matrices consisting of » — q columns of J, (in 
Yes | 

their given order). Let C, = Ly. Now, in case the parameter B is restricted to 

the set 


A, = {BE Max (p-a) | ALC] il ea 


P 
PE 
in (35) are also asymptotic Q,,-estimators. Indeed, since 4, is open, the above 
derivations carry over; moreover, it can be shown that Mqp-¢)(4x) = 1. 

We adopt the convention of inserting a g-inverse (L,AC,)~ if R(AC,) 
€ (Myx (p-q))s Le. if (Zp AC,)~! does not exist. Then (35) implies 


Boom = Li QnC(LpQnC.) 


then the estimators Boa Rem Ne Stig ) resulting by replacing Ly by C, 


Observe that under (V3) in the case g = p — 1, Bo. is just an estimator of 
the form (3.3.20) (cf. Lemma 3.3.1). Here 2 = n(m) (m — n(m))-1, i.e. Bava 
is the 2SLS-estimator modified to consistency in the general case 
lim n(m) (m n(m))-1 =+ 0. Under (V,), » = 1, 2, the estimator Bae is the 
m—>0co \ 

analogue to this; in the general case g < p — 1, Baym is the natural generali- 
zation (the classical 2SLS-estimator and LIML in simultaneous equation mo- 


dels correspond to estimators of B in LIFU* in the case p — q = 1; see Section 

P 
Pg 
3.4.7 (for r = p) we just obtain the estimators (3.3.18), (3.3.19). 

In the literature, most investigations on the efficiency of estimators in 
linear functional relationships (LIFU*) have centred on the comparison of 
Bom and Ba 

The usual adaptation of this estimation method to the case q <r < p 
consists in replacing 2(Q,,C,) in (35) by 


3.3.1). For » = 2, gq = 1 we have == 2, and in the model of Example 


Ont! (I+ + IRC), 


where C% € Mt, (rq) consists of r — q columns of J,. According to Section 3.4.1 
(iii) and Lemma 3.3.2, as well as [A 1.5], the matrix Ff is a function of O,,; 
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the resulting estimator 
os 1 4 , 
ple (QnF gi (I+ + TRC%))) if Ong! (I++ TRO2) € OMe <cp-w) 
1(On) otherwise 


is thus a function of Q,,. Using Section 3.4.1 (iii) and [A 1.5] it is easily shown 
that Bi,)m is also an asymptotic Q,,-estimator if the parameter B, is restricted, 
in accordance with Section 3.3.4 (ii), to the set 


A, = {B, € Max (ra) | r[L,C?) sta Gye 
Analogously to Remark 3.3.10 it can be shown that 
Bum = (Biuom ey rt Biiainn 8 tgs (p-tyole Bachem Mg x (ra) 
Bocoym = Lig Qi.C2( Lin Qn. x) 
Byjm = Ly, ZamM in = Ls ati 


(Lo, Lz, according to (xiv)). 


Example 2.4.11 Consider the case r = p = 2, g = 1. Let I’, be determined 
by (cf. [A 1.10]) f, = Diag |1, 1, 1//2, 1]. Let f : R° > RY, 


Ha) = (5/24)? sgn xp if 7 = (%;)j-1,2,3, 4% =O 
0 otherwise 
Then 


{(PLpGL;)=B  V(B,G)«R*,G +0, 


ie. f fulfills (19). Furthermore, for each point from A \ {x | # = (%j);-1..3, 
Z_ — 0} there exists an open neighbourhood on which f is continuously diffe- 
rentiable. If the parameter value B = 0 is excluded we obtain, similarly to 
the above, that the estimator 


Bam = (meo2/Im11)"? 880. Umi2 


for Q,, = ((4mig))#=1'2 is an asymptotic Q,,-estimator. 
The estimator Bp,, was introduced by Tukey (1951) and discussed by Ma- 
dansky (1959) and Dorff and Gurland (1961). 


Example 3.4.12 (Minimum contrast estimators) We consider a modification 
of the principle of minimum contrast estimation (see Pfanzagl, 1969) to esti- 
mate B on the basis of the convergence (18). 

First let us suppose the general implicit case. Let 


PF: Rey QR 
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be a continuous function with the property 
VQ EM: RQ) = £ € yp > FPG, £) < FUL, £’) (36) 
Whig SB aerate se os 


P,p—q* 
F is called a contrast function. A minimum contrast estimator (MCE) Sot me 
is an estimator which satisfies 


FLOm: ne) = inf F(L,Om, £). 
LELp pa 
From the compactness of &,p-, (see [A 3.1]) we infer the existence of an MCE 
with (nonrestricted) values in 2, ,-,; in the same way, the (strong) consistency 
can be shown if 0, +> G €M?>4 (analogously to Pfanzagl, 1969; see also 
3.5.4.1, Theorem 3.5.5.) 
In the explicit case we put 


GB ye—e | so(B)\e 


Let F be twice continuously differentiable (on an open neighbourhood of 
A X Max (p-q)- Let 


F(x, B) := a3 F(a, B) 

0,F'"(«, B) := 0,F"(x,B), —,8°’(a, B) := @3F"(a, B) 
and suppose 

[OF (P,L,GL5, B)| = ap — 9), 

V(B, G) € Mgx(p-q) X Mpg, IGl=p—g. “ 
Then, for B,, := @-'(Lm) if Ln € Ota ony) 

FL Gn, Bn)’ = Opa | (38) 
and because of (36), 

FUE SGLS By 0, (39) 


From (37), (38), and (39), the consistency of B,,, and the asymptotic normality 
of (((m))4? (Q,, — Lp@nLz), we derive 


B,, — B= —(0,F'(f,L,GL5, B))-1 0,F'(f",L,GL’,, B) 
X (Om — LpGnb’,) + op ifi(m))-1) 


= Cl (Qm — LeGnL'g) + op((m(m))-22). (40) 
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_ By differentiating (39) with respect to B and G we conclude, as in (22) and 

(23), that Cy satisfies (b) of Definition 3.4.1. Hence, consistent MCE of the 
kind described are asymptotic Q,-estimators. It is easy to see that already weak 
consistency is sufficient. 

By this method Robinson (1977a) constructed an MCE for a model with 
dependent errors (see Section 3.5.4.1). This model includes as a special case the 
one (V,), r= p, W,, = R™, m = m considered here. For the special MCE 
derived, a relation (40) is shown; from the above it follows that this is an asymp- 
totic Q,,-estimator. However, in general the estimator is not optimal within 
the class. 

In accordance with Remark 3.4.5 also random contrast functions depending 
on Z,, can be admitted if appropriate conditions are imposed on the conver- 
gence of the function and the relevant derivatives. 


Remark 3.4.6 It turns out that the theory of asymptotic Q,,-estimators is 
analogous to the Gauss-Markov theory in the linear model. Furthermore, in 
view of Example 3.4.12 and [A 2.12], we note a relation to the theory of 
minimal contrast estimation in the case of identically distributed observations. 
Indeed, the MLE is asymptotically optimal within the class of the MCE 
(Michel and Pfanzagl, 1971); the corresponding algebraic problem coincides 
with the present one. 


Remark 3.4.7 Under the specification (V2) the alternatives to Bom, according 
to Examples 3.4.10—3.4.12 are functions of 63,, and thus of Bom (since Qn 
is a function of 6%,,). The preceding results imply in this case that such a two- 
stage method does not yield any improvement of the CIVE. An analogous 
result can be obtained when using the alternative consistent estimators of 
o7 of Section 3.4.2. 

The results obtained suggest further asymptotically efficient estimators in 
addition to the CIVE Bem. Let By, be a consistent estimator of B and 


x 


C,, = Fg. (Ps. + Sx) Fp La... 
As in Lemma 3.4.10 we show 

Crs Cys (41) 
under the corresponding assumptions. Let the estimator B,, be defined by 


Wok | OUROnCn)) if ROnCm) € (My x v-n) 


arbitrary otherwise 
Taking into account 


P(ROnCn) € 0(Mz x(p-a))) i Sag 1; 


23* 
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be a continuous function with the property 
VQEN, RQ) = LE RS FEO, 2) IL) (36) 
bn Porally hak Oe ar 


F is called a contrast function. A minimum contrast estimator (MCE) ?,, of £ 
is an estimator which satisfies 


FUP Om, 2m) = iat FUL Oms L): 
LELp, pg 
From the compactness of L, »-, (see [A 3.1]) we infer the existence of an MCE 
with (nonrestricted) values in &,,»-,; in the same way, the (strong) consistency 
can be shown if Q, ———+ G EMP, (analogously to Pfanzagl, 1969; see also 
3.5.4.1, Theorem 3.5.5.) 
In the explicit case we put 


HB) o(B)\ 


Let F be twice continuously differentiable (on an open neighbourhood of 
A x trtaaa)e Let 


F(x, B) := 03 F(z, B) 

0,F'(x, B) := 0,F'(x,B), F(x, B) := 03F'(a, B) 
and suppose 

r[aoF(P,L5GLz, B)| = ap — 49), 

V(BeG) © Nas og) ngs 2 1G) =p — 7q- 
Then, for B,, := @ULm) if 2m € O(Ntyx(p-a)) 

F'(LGm, Bn)’ = Og p-a) ee 
and because of (36), 

P(E, L,GL4, BY = 0p (39) 


From (37), (38), and (39), the consistency of B,,, and the asymptotic normality 
of (((m))? (On — LpGnLz), we derive 


B,, — B= —(2,F'(f,L,GL 5, B))* 0,F'(f,L,GL',, B) 
x (Om — TROaE) + op (i(m)) 2/2) 


= Cel (Qn — LaGnL's) + op((ite(m))*?).. (40) 
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By differentiating (39) with respect to B and G we conclude, as in (22) and 
(23), that Cy satisfies (b) of Definition 3.4.1. Hence, consistent MCE of the 
kind described are asymptotic Q-estimators. It is easy to see that already weak 
consistency is sufficient. 

By this method Robinson (1977a) constructed an MCE for a model with 
dependent errors (see Section 3.5.4.1). This model includes as a special case the 


one (V,), r= p, @,, = R™, m = mp considered here. For the special MCE 


derived, a relation (40) is shown; from the above it follows that this is an asymp- 
totic @,,-estimator. However, in general the estimator is not optimal within 


the class. 


In accordance with Remark 3.4.5 also random contrast functions depending 


on Z,, can be admitted if appropriate conditions are imposed on the conver- 


gence of the function and the relevant derivatives. 


Remark 3.4.6 It turns out that the theory of asymptotic Q,,-estimators is 


_ analogous to the Gauss-Markov theory in the linear model. Furthermore, in 
_view of Example 3.4.12 and [A 2.12], we note a relation to the theory of 


minimal contrast estimation in the case of identically distributed observations. 
Indeed, the MLE is asymptotically optimal within the class of the MCE 


(Michel and Pfanzagl, 1971); the corresponding algebraic problem coincides 


with the present one. 


Remark 3.4.7 Under the specification (V2) the alternatives to Bc, according 


to Examples 3.4.10—3.4.12 are functions of 6%, and thus of Bom (since On 
_ is a function of 62,,,). The preceding results imply in this case that such a two- 
stage method does not yield any improvement of the CIVE. An analogous 
' result can be obtained when using the alternative consistent estimators of 
| o; of Section 3.4.2. 


| 


The results obtained suggest further asymptotically efficient estimators in 
addition to the CIVE Bem. Let By, be a consistent estimator of B and 


Ca Fg! (Ps. + Sn) Fg La,,.- 


| As in Lemma 3.4.10 we show 


C.. sei Oss (41 ) 


m—->oo 


under the corresponding assumptions. Let the estimator B,, be defined by 


B = o-(R(OmCm)) if R(OmE m) € Oa (pea)) 


arbitrary otherwise 
Taking into account 


Po(ROnCm) € 0(Ms x (p-a))) ERB E 1, 
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By — B= L4(Qn — LaGnl'g) En(LQnn)? if 71LQnCnl =p —@ we 
show as in Theorem 3.4.6 that B,, is an asymptotic Q,,-estimator. Theorem 
3.4.9 (see also Theorem 3.4.8) and (41) then imply that B,, is asymptotically 
efficient. 

As the simplest consistent initial estimator we can use Bt = = Baym from 
Example 3.4.10, i.e. the modified 2SLS-estimator. Then we Aner 


KEn) = Fy! (Pas + Sn) Fp OnE pg. (I+ + TRL) 
= Fy (F* + IRS** Ot Ln); 
for the resulting estimator, denoted by Bam, we have 
Bon = o7'(Fp,(I+ + RIOLSS OS L20))) 


if the right-hand side is defined. From [A 1.6] (b) and (c) we obtain, analogously 
to Remark 3.3.10, 


Bam a (Bizm t Borm)s Bins € WMaxcipry2 eae © a tee) 


Bozm = Ly QnSm 'QnLro(L20OmSm 'QnLoo)- (42) 
Bien = Lg ZomN ins (43) 


where Lyo, Lz) are defined according to (xiv). 


Corollary 3.4.3 Under the assumptions of Theorem 3.4.9 the estimator By», 
defined by (42), (43) is asymptotically efficient. 


Under (V,), » = 1, 3 it can be shown that in (42) ‘~’ almost surely can be 
replaced by a since the matrix in question has full rank. Contrary to Be», 
the estimator Bz, is an elementary function of the observations, which does 
not require the solution of an eigenvector problem for its calculation. 


3.4.6 The general nonnormal case 


The asymptotic distributional and optimality statements obtained in the 
Section 3.4.4, 3.4.5 essentially rely on the assumption (N) of a normal distri- 
bution of the observations. Now we investigate what results may be obtained 
in the case of the general distibutional assumption (3.4.1). 

A crucial technical result in Sections 3.4.4, 3.4.5 was the asymptotic nor- 
mality of (%(m))¥? (Qm — LpGinL’z); for this especially the asymptotic nor- 
mality of m-(Oz, .»,, — HoQz,,.w,,) underm! dim WY, ———+ a €[0, 1] had to 
be shown. In the general nonnormal case suitable limit theorems are not avai- 
lable, so we have to restrict ourselves to special cases. These are, firstly, the 
cases (V,), » = 1, 2, « = 1, and secondly the case « = 0. 
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The case x« = 1 


Here a restriction to (V,), » = 1, 2 is necessary because under (V3) the problem 
of the limit distribution of (m — n(m))¥/? (S®) — H,S®)) arises. Under the 
assumptions (C4) and (C5) the asymptotic normality of m4/?(O,, — LzGnL) 
and hence also of the asymptotic Q,,-estimators (Bein: can be shown. But 
it turns out that DEUBalnsn) depends on the third and fourth moments of 
the underlying error distribution P?. Thus the class of asymptotic Q,,-estima- 
tors is no longer interesting because in general there does not exist an asymp- 
totically efficient element (in the sense of Definition 3.4.2). Therefore we 
restrict ourselves to giving the limit distribution of m1/2(Q,, — LgG,,L;,) and 
mU2(Bom — B) for (V,). The proofs are easily obtained from the central limit 
theorem (Bunke and Bunke, 1986, theorem 2.4.3) and the form of the covariance 
matrix of matrix-valued quadratic forms [A 3.19]. 


Lemma 3.4.16 Under (V,), (C4), (C5), « = 1, we have 
Lim'(On — Gn)} moo Noxp(9pxp, Ae) 


Ay = F, — Ed, + 2G © ©) I, + 2G © G:) + 47, (4 @ 2] T,- 


Theorem 3.4.10 With (V,), (Hx), (C1), (C4), (C5), « = 1, v = 1, 2 for the 
CIVE {3 ial eae we have 


£{mi!?(Bom — B)} aan? N4x(p-a) (04 (0-29 D?({Bom}m=m,)) 


DB.) aa Cro Gel Coe = GG ®&) Col 0,5 
+ G19 © Lt’ OP Cig + G © Lz Se Lt 
with Cos from Lemma 3.4.10. 


On the basis of Lemma 3.4.16, the asymptotic normality of all asymptotic 
Q,,-estimators under (V,) can be shown; this can also be done for (Vg). 


The case « = 0 

In the case of « = Othe termQ,, — LG,L’; is by Lemma 3.4.1 a linear function 
of the observations, up to op(m-/?); from this the asymptotic normality of 
mi2(Q,, — Gm) easily follows. The result with respect to the class of asymptotic 
Qn-estimators then corresponds to the one obtained under normal distribution 
in Section 3.4.5: the difference of two elements is op(m~1/?), i.e. the elements 
are asymptotically equivalent. The meaning of this result will be discussed in 
Section 3.4.7. 


Lemma 3.4.17 Let « = 0. If (V,), (Bx), (C5); or (V2), (Ex), (C1), (C2), 
(C5); or (V3), (Ex), (C3), (C5) are satisfied, then 


Li{m'2(Onm aie LpGnLz)} EES A Nox p(Op xp» A») 
A» — 40°, (LpGL; © 2) DI. 
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Proof. First we consider the specification (V,). By Lemma 3.4.1, because of 
EsQm = LpGnL', and Op(m-' tr [Py, }!?) = op(m-?), we have 


m'2(Qm, — LgGmL'z) = mS mPy, Min + MmPw,bm) + or(1). (44) 


Hence it holds for each K € IN, almost surely that 


m 
ml? tr [K(Om — LpGpL'g)] = 2m-12 Ye!” Py, Mi,KP 3S; + op(1), 
i=1 
(45) 
because almost surely €; € J,7 = 1. 
According to Bunke and Bunke, 1986, theorem 2.4.7), (45) is asymptotically 
normal if 


+0 


mM—>Co 


m =) 
(= reset ay max |JJ'KMpPy),e?2 


i 
¢=1 1<ism 


is fulfilled. With 
™ 
Oe we. |J’KM,Pye9”|l? = tr [KP KL;G,,03]| 


= + tr | Key hh, Chee 


m—-co 


and (C5), we obtain the asymptotic normality of (45) for tr [KP;KL,GL,] > 0 
and op(1) for (45) in the case tr [KP;KL;GL‘,] = 0, respectively. As Ay only 
depends on the second moment of P*, A» results from Lemma 3.4.8 for « = 0. 
Let us consider the specification (V.). First we obtain from the statement of 

the theorem for (V,) and from Lemma 3.4.10 as in Lemma 3.4.11 that 
BW, — B = Op(m-4/*) and thence (3.4.16). For R,, according to (3.4.15) we 
infer from Lemma 3.4.1 that 

R,, = Ry, + Op(m-"?), 

Ri, = mE nPytMn + MnPvsbn): 
Then D,R*, = O(m-) follows from (C2) so that Rj, = Op(m-1?), R,, 
= Op(m-1!?). From this, from (3.4.16), and Lemma 3.4.6 (a) it follows that 


© _ 1 .GL', = Q® — L,GML', + 0,(m-"2) (46) 


for » = 2 which proves the assertion. Furthermore 
OP = QP — nlm) m= (SP — E,) 
ee oF on n(m) (m wa n(m))-1 R,, = n(m) m* DpH a, Ls c 


As above we can conclude from (C3) (b) that R,, = Op(m-/2) and from this 
(46) for » = 3, which yields the assertion. 


From this lemma, the next theorem is an immediate result. 
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Theorem 3.4.11 Let « = 0. If (V,), (Ex), (Cl), (C5); or (V2), (Ex), (C1), 
(C2), (C5); or (V3), (Hx), (C3), (C5) are fulfilled, then for arbitrary meymuptonic 


Qm-estimators {Bn}m>m,> (B* mm, we have 
B,, — B*, = op(m-12), 
EAE parce BY) ee ON cea aes Oe) 
Dig = ove aa 4 


Finally we consider the special case n(m) = n, m = mp, i.e. the case of an 
instrumental variable of fixed dimension (Example 3.4.3). If we assume con- 
dition (C6) here, we immediately obtain 


mL Wy ——> LyT € MP4. (47) 


For this case, Villegas (1966) on the basis of (47) defined a class of estimators 
of B, the so-called ‘ordinary estimators’. This class results from the estimators 
Bom := 0 '(Lpm) in Example 3.4.9 if W>,, is admitted there as depending 
on the observations with Wp, > W €Minxn- The equivalent version 
of (47), 


m*Y,,W., ore BmX,,W,, ats En» E, "o> Onn 


(cf. (xiv)) can be put into correspondence with a linear regression model; the 
class of ‘ordinary estimators’ can be related to the class of linear unbiased 
estimators. The asymptotic normality of the ‘ordinary estimators’ (Bene 
was proved under the general distributional specification, where D5? GB agen) 
depends only on the second moment of P*. An optimization with respect to 
DF ( (BRE? analogous to the Gauss-Markov theorem proves the asymptotic 
efficiency of the 2SLS estimator (more exactly of its system analogue; cf. 
Example 3.4.10). 

A variant of this method for general simultaneous equation models was 
given by Brundy and Jorgenson (1971). 

With the methods of Section 3.4.5 we can easily obtain an extension of this 
result in the following way. On the basis of (47) a class of estimators, say 
asymptotic P,,-estimators (for P, := m—1Z,,W7,), can be defined, in the same 
way as the convergence of Q,, (3. 4, 18) led to the definition of the asymptotic 
Q,,-estimators. Here the ‘ordinary estimators’ are a subclass; if we further 
proceed as in Section 3.4.5, then the result of Villegas (1966) appears as a 
consequence of the corresponding theory. According to Remark 3.4.6 the 
optimality result with respect to the class of asymptotic P,,-estimators also 
corresponds to the Gauss-Markov theorem. Furthermore, it can be shown that 
the asymptotic Q,,-estimators are in this case efficient asymptotic P,,-esti- 
mators. The formal proof is left to the reader. 
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3.4.7 Final remarks 


Let us now consider the problem of asymptotic efficiency within the class of 
consistent asymptotically normal estimators (in the sense of [A 2.8]). Let d, 
be the dimension of the parameter space for (P’, B) under (N), (Ex), according 
to Example 3.4.1; then we have 


d=qp—q), @=dp—gqt+1, d=qap—gq)+r7(r + 1)/2. 


The following theorem indicates the lower bound for the covariance matrix 
of asymptotically normal estimators of B for a fixed sequence of parameters 


{Eihien- 


Theorem 3.4.12 Leta model according to Example 3.4.1 be given by {P5 | # € O}; 
let O be defined by OL P® KX Max (pq) X {{Edien} for a given {Eien € RP 
fulfilling (C2), and by assumptions (V,), (N). Then for each estimator {Bm\m>m 
such that 


B,, =e aL aes {Eiiew)> m = Mo 
with 

L{mi2(Bp Ge B)} Foes NG ca is eae D?({Bm}mzm,)) 
one has 


D?({Bm}mzm,) 2 Dooo — (#a,] 
2.e. for almost all (in the sense of a,) parameter values (X;, B), where 
Doge lg GL, SL: 


Proof. It suffices to demonstrate local asymptotic normality of the model for 
fixed {&;};eq (see [A 2.7, 2.8]); here we will omit this proof. A proof for a general 
model of independent not necessarily identically distributed observations can 
be found in Ibragimov and Khasminski (1979, chap. II); compare also Philippou 
and Roussas (1973), Roussas (1972), Andersen (1970b). 


Hence D5>, is a lower bound for known nuisance parameters {&;};-y and thus 
also a bound in the case of (partially) unknown {&;};-, ie. in the present model. 
But this bound is attained by the asymptotic Q,,-estimators, especially the 
CIVE, in special cases only. According to Remark 3.4.3 this is the case if 
« = 0 and 


H =H, (48) 


holds. In particular, (48) is fulfilled in the adequate case; (48) can be considered 
as a condition of ‘asymptotic adequacy’ of a model with (A) in case that (A) 
is not necessarily satisfied by the data. In the case of n(m) = n, m = mp, there 
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is also an interpretation as an analogue of a condition in the model (LIFU-) 
of Section 3.3.4, 


Xy-w = On <p> 
i.e. to a condition of ‘full correlation’ between the (partially) nonobservable 
random variable ~ and the instrumental variable w. 

If « > 0, this lower bound (under (N)) is not attained by the asymptotic 
Qm-estimators, especially not by Bo, (cf. Theorem 3.4.9, 3.4.5). Hence (for 
the adequate case) the asymptotic efficiency of the MLE with respect to D®, 
is impaired, due to the presence of the unknown incidental parameters N>,, in 
the model, the number (of real components) of which increases indefinitely. 

The general case of a model with an indefinitely increasing number of 
unknown incidental parameters with a structural parameter to be estimated 
(see Remark 3.4.1) was investigated from the point of view of asymptotic 
efficiency of estimators by Neyman and Scott (1948). They discussed the 
possibility of impaired efficiency of the MLE in the sense indicated. The 
possible occurrence of this situation in the model according to Section 3.4.1 
is the motive for considering the class of asymptotic Qn-estimators. The theory 
of this class permits the asymptotic efficiency of the MLE (or more generally of 
the CIVE) to be established with respect to a number of alternatives. 

For the case of a finite number of parameters (Example 3.4.3 under (A)) 
these alternatives to the MLE are asymptotically equivalent; this corresponds 
to the situation in simultaneous equations models (see Theil, 1971; Schdnfeld, 
1971). Moreover, for this case the standard statements about asymptotic 
efficiency of the MLE (cf. [A 2.9]) are valid. The case « = 0 can be regarded 
as a case of an ‘essentially finite’ number of parameters; the efficiency of the 
MLE is then essentially, i.e. asymptotically, unimpaired (Neyman and Scott, 
1948)). 

The result on the optimality of the CIVE also in the case « > 0 allows an 
interpretation in the sense of robustness or ‘asymptotic efficiency of higher 
order’. According to the explanations of Example 3.4.7 the asymptotic effi- 
ciency in the case of « > 0 can be regarded heuristically as being nearer to 
optimality for finite samples (with e.g. small allocation number per group in 
Example 3.4.7) than the asymptotic efficiency in the case « = 0. 

Concerning the comparison of the competing estimators, i.e. of the CIVE 
Bom and the modified 2SLS-estimator Bs, partially including Boss (for 
r = 2, Example 3.4.10) and Bo, (Example 3.4.11), similar results have been 
obtained in the literature. To compare Bom and Bas Anderson (1976) used 
asymptotic expansions for fixed m and certain diverging parameter values 
(see Section 3.5.2). Tukey (1951), Madansky (1959), Dorff and Gurland (1961), 
and Robertson (1974) formally calculated asymptotic variances. Moreover, the 
conclusion of Fuller (1977) on the basis of the asymptotic MSE of modified 
MLE and 2SLS-estimators (Section 3.5.3, (3.5.60)) agrees with the one obtained 
here, i.e. with the efficiency of the MLE: 
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Anderson (1951b) calculated under (V3), (N), (A), n(m) = n (i.e. in the 
model of Example 3.4.3) the limit distribution of a parametrization of ?om 
as a matrix of eigenvectors. Malinvaud (1966) obtained a similar result for 
(V,), (N), « = 1 and from this he calculated Dg} for p = r = 2, q = 1. Schnee- 
weiss (1976) calculated the normal limit distribution of Bay under a general 
error distribution in the case of the specification (V,). Patefield (1977, 1978) 
found= Dy for (V,) 0H ly p= 3,4 = 2, gai ee ior (Vj -on ke 
r=p,q=1. Van Houwelingen and Schipper (1980) obtained a result on the 
accuracy of the normal approximation for the distribution of a parametrization 
of Lom. The calculation of Do has been performed by several authors (see - 
Section 3.5.1). From Sections 3.3.4 and 3.4.6 one recognizes that for IV with 
fixed dimension the distinction between random and nonrandom y;, Wj, 
t = 1,..., m is not essential for the asymptotic theory. Hence the results on 
the asymptotic theory in models (LIFU-) of Section 3.3.4 can be carried over 
to the case of nonrandom p;, 7 € N. 

The asymptotically efficient estimators Bz, given in Section 3.4.4 can be 
seen as resulting from an iterative improvement procedure specifically adapted 
to the present model. The original method for models of independent, iden- 
tically distributed observations, the heuristic background of which is the 
approximation of solutions of the likelihood equation (LeCam, 1956), can not 
be applied here because of the presence of incidental parameters that are not 
consistently estimable. Some authors discuss inference on the basis of a modi- 
fied likelihood function, obtained by suitable elimination of incidental para- 
meters (Kalbfleisch and Sprott, 1970; Sprent, 1976; Klebanov and Melamed, 
1978; Patefield, 1978). The estimator Bz,, can be interpreted as resulting from 
a improvement procedure constructed on the basis of a modified likelihood 
equation. 

The comparisons of efficiency presented here always referred to estimators 
defined by means of the same instrumental variable. Comparisons of different 
instrumental variables as well as special methods of improvement and com- 
bination (Ware, 1972; Feldstein, 1974; Schneeweiss, 1975) remain outside the 
scope of this treatise. 


3.5 Special asymptotics 


In this section we describe some important results, which could not directly 
be included in the relatively closed theory for LIFU* with independent errors 
given in Section 3.4. Especially there are the information matrices for MLE for 
nonlinear models with replicated observations. Furthermore, some of the 
estimators already introduced in Section 3.3 are considered. We-also report 
results concerning the behaviour of estimators for finite sample size. 

In view of the large number of available results it will only be possible to 
provide a survey of the asymptotic statements in this section, often without 
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detailed proofs. Thereby we try to present the results in such a way that in a 
practical application of the described estimation methods the accuracy of the 
obtained estimations can be computed approximately. Moreover, some of the 
approximations described in the following do sometimes admit statements for 
finite sample sizes in practical use as well as the construction of confidence 
intervals and tests. In the present section the following problems will be dis- 
cussed especially for eigenvalue estimators in LIFUt: 


1. The distribution theory for finite samples by applying infinite-series ex- 
pansions. 

2. The approximation of these distributions and their parameters for in- 
creasing sample size or related models with changes of other model para- 
meters. ; 

3. Statements on accuracy by comparing different measures of concentration 
and their approximations. 


Because of the difficulties with multivariate models we have almost exclusively 
only been able to present results for the case d, = 1. The accuracy of the 
approximations can be compared either by comparison with the exact dis- 
tributions or with sufficiently exact simulations. Often these comparisons 
themselves have been obtained by simulations. The corresponding results are 
reported. With respect to the intention of the chapter the exact distribution 
theory will not be taken into account. 


3.5.1 Asymptoties under fixed experimental design 


3.6.1.1 The model 


In certain cases it is possible to develop a series of measurements in such a way 
that we have repeated observations over a fixed experimental design. The 
MLE (cf. Section 3.2.6—3.2.7) provides consistent and as we will see asympto- 
tically optimal estimators of the system parameter and of the experimental 
design. The corresponding information matrix and its approximation from 
the sample give an approximation of the accuracy of the estimation. We 
consider explicit models with replicated measurements according to (3.1.49) 
but without restrictions of the form 0 = p(z), as they were still permitted in 
Definition 3.1.3. For simplicity we suppose the same number of replications 
over the single design points and we also suppose normally distributed errors. 
Hence, let 


HS yf x ea eee tS ae Race et 
&ij Mi + Si, y ) (1) 


Hi = T(E, m) = (W;3 €i; ™) 


wf 
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The basic formulas are derived for a general covariance matrix 2 € Wz. 
Specializations are carried out for 


2 = Diag Tn, © i) (2) 
and 
Qa Ores 


The case 2 = A @ ~ can be treated expecially for LIFU* as simply as it 
results from the corresponding remarks in Section 3.2.5. In nonlinear models 
this case can also be treated simply. Because the case of a known matrix A is 
practically of no importance it is left out here. Formulas concerning this are to 
be found for variable-classified data Z = [Z(),,..., Z(?,] and hence in a 


slightly different form in Dolby and Freeman (1975) for d, = 3. 


3.5.1.2 Asymptotics for maximum likelihood estimators 


Under sufficiently general assumptions the consistency of the MLE already 
results from the fundamental paper of Wald (1949). The numerous possible 
generalizations will not be considered in detail. We recall only the generali- 
zations to minimum constrast estimation (cf. Strasser, 1973) and to models 
with unequal numbers of replications over the single design points. 

In the proof of consistency by Wald some assumptions are made. Besides 
some assumptions on the distribution of the errors for the present case, only 
one single assumption on the inner structure of the model is applied (Wald, 
1949, p. 596, assumption 4), namely, for different parameters the related di- 
stribution functions of the observations shall not be identical. But this is 
fulfilled for all models satisfying the assumptions of the identifiability theorem 
mentioned in Section 3.1.4. If necessary the parameter region has to be slightly 
restricted, as was explained following the identifiability theorem in Section 
3.1.4. 


Theorem 3.5.1 (Consistency theorem for MLE in models with fixed experimental 
design) In models with an equal number of observations on all design points the 
MLE is consistent if the conditions of the identifiability theorem are fulfilled and 
the distribution of the observations satisfies certain weak assumptions. 


Theorem 3.5.2 (Optimality theorem) If the 2n);,7 = 1, 2, ... are stochastically 
independent random variables with the same density p,(z) and if this density 
fulfils certain simple regularity assumptions for the true parameter wo, then each 
consistent MLE is optimal under all asymptotically normal estimators in the 
sense of minimizaton of the asymptotic covariance. With the notation 


2m) >= (%nyj)j=1,...mo? mM = Mn, 
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the convergence 
LY m0(¥(2m)) — Yo) > N(0, Inf (yo)-1) (3) 
as valid for almost all wo € Y if Inf (yo) is the information matrix. 


A proof of the theorem where the regularity conditions are given in detail 
is to be found in Nélle and Witting, 1970, theorem 2.32), for example. 


3.5.1.3 The information matrix under a normal distribution 


Let (aN = N (0, D,(é)). The likelihood function for this case was given in 
Section 3.2.2. With the natural partition of the whole information matrix Inf 
into the blocks Inf, Inf;,, ..., Inf,, we get , 


Inf, = E, (041 dq) 
= 0,p QB, (SS) Qu 


eA (OP Cah) Cas) (Os. xan | Ont il)is (4) 
Infxe = (0| Gyr)" De(5)-4 Diag; (Za | ri); (5) 
Inf; = Diag; (I dzri)) D,($)"?, Diag; (Z| Gzr;]). (6) 


In the blocks belonging to y, we always have Inf, = 0 (0 = [&, 2]). Under the 
usual regularity assumptions on /, this results from 


E(6,1 yl) = —E(é,(2,1)) (7) 
(cf. Zacks, 1971, lemma 4.3.1) and from 
Ayal) = Goon) OnaD(S)$, oe = [6,7] (8) 


and because of E(¢) = 0. 

Now we want to give the block Inf, only for the case D(C) =I ® 2,y = x. 
But this is a standard problem from multivariate analysis for parameter 
estimation of a multivariate normal distribution (cf. Anderson, 1958): 


otugst + ott ost; r =e 8, tru 
Inf = 3 ottos; (f= Ub SS Ue (9) 
(ot)2/2; pom 8, ti W, 


if 3-1 = (a). 


3.5.1.4 Asymptotics for weighted least squares estimators 


Assume the model with the block diagonal covariance structure (2). That is the 
reason why we do not take the class of all WLSE as a basis in the following but 
we consider a slightly restricted class with similar block diagonal weighting 
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matrices. If we have more general assumptions on the distribution of errors 
we will correspondingly have to permit more general weighting matrices. On 
account of Section 3.2.2 we obtain MLE with known covariance. The weighting 
matrices we are going to consider now shall be of the form: 


W = Diag ((In, © Wi)), WE MZ: (10) 


The parameters of the model are y = [&(n), 2, 21, ...,£,]. But for the calcu- 
lation of the WLSE we only need the part 


9 := [Ema]. (11) 


With that we obtain as the criterion for the estimator of ? the estimation 
functional 


18) = m3 ¥ lle — wilh,» m= Dm. (12) 
OW) i 
Elementary transformations yield the equivalent estimation functional 


1,(8) = m=? Xm |i; — milty, (13) 


The main simplification in this model with fixed experimental design is that 
we have in fact got a regression model, which becomes obvious from the func- 
tional dependence of the parameters yw; on #: 


k= Mild), o = (e(9)> a], (14) 
2 = w(P) + Gj. (15) 


With this the functions yu; totally correspond to the regression functions in 
usual regression models and we can apply all results known from regression 
models here. 

By means of the identifiability theorems it turns out that we have identi- 
fiability of the system parameter in the models covered by the contact condi- 
tions (cf. Theorem 3.1.4 in Section 3.1.4). Under certain regularity assumptions 
the consistency and asymptotic normality of the WLSE follow. Now we apply 
to the special regression model the results from Chapter 1, from which we do 
not need the more complicated extensions for infinite experimental designs, 
but the extensions for multivariate regressands. However, the corresponding 
modification of the results from Chapter 1 is a purely formal matter. 

In order to secure the possibility to transfer the mentioned results we have 
to check the regularity assumptions on the differentiability and smoothness 
of the function 


wi = wil) = [Ei Tin £2) (16) 
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which have mostly to be considered to be satisfied in practice. Necessary for 
the proof of consistency is the compactness of the parameter set O for #. In 
practical applications this mostly results from physical or other natural 
bounds for the region where the state variables and structural parameters may 
vary. In the present special case we thus have finally to check the identifia- 
bility assumption which was formulated for more general models. For this 
purpose we assume that the relative frequencies of the numbers of observations 
over the single design points are positive: 


m—>oo 


This is a sensible description of non-negligible observation frequencies over 
each design point. The identifiability assumption by Jennrich (cf. Section 
1.1.6.2, condition A4) now demands that 


Di \lui — Halli, = 0 (18) 
i=1 
yields 
o = Bo - (19) 
But, for h; > 0, 


eae ano Wee nO) 


results from (18). 
Because of the special form of the m4; in explicit models, 


§;= fn; i= 1,...,n, (21) 
which yields the equations 
7i(Eio, %) = TilEio, No), ¢=1,...,n-. (22) 


Now we can apply the identifiability theorem from Section 3.1.4. According 
to this x = mp for n > dim JJ under the practically mostly fulfilled conditions 
on the system equations. If necessary one has to add an inessential restriction 
of the parameter region y, as was explained in Section 3.1.4 following the 
identifiability theorem (Theorem 1.1.5 with n replaced by m = Xm;). 


We will immediately formulate the results of Chapter 1 in the needed multi- 
variate form. 


Theorem 3.5.3 (Theorem on the asymptotic normality of WLSE in models with 
fixed experimental design) For WLSE with weighting matrix Q = Diag (I ia 
® W;)) we have 


£{/m(S — 9) > N(0, B(9o, W)), (23) 
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where 39 = [E(nyo» Xo] ts the true parameter and the covariance of the limit distri- 
bution is obtained from the following relations: 


D(H) = 3(%, W) 
= 0 (Diag ((h:W)))-# © (Diag ((h; - Wi - Xn - Wi))) 


x C (Diag (2; - Wi))) (24) 
with 
C(W) := O5l(nyo -W- O64 n)0 > (25) 


where Oopin) has a special form corresponding to our special model: 


Bolt(myo = (Gg(nylo | Onfto) (26) 
gtx (F10, Mo) , | 0 Only (E 105 To) 

heen Ome re a @ 
0 | 20 ! Oztn(Eno, Xo) Onbn(Eno» Xo) 

Oxi = [Lag | Osri], Anti = [Od xa, | Oni) (28) 


3.5.1.5 Asymptotic optimality of weighted least squares estimators 


Possibly the practically most important statement with regard to the mini- 
mization of the limit covariance matrix is 


‘GLSE are the best WLSE’. 
This limit covariance matrix attains for GLSE with w; = 2;)' as minimal value 
°D(Ferse) = O(Diag (Xjp' - hi) = O(Diag (Zio/h;)). (29) 


This optimality follows from a computation of the lower bound as in Theorem 
1.1.6. But the covariances 2’) are mostly unknown in applications. With repli- 
cated observations they can be consistently estimated. The optimality state- 
ment remains valid for each consistent estimator of 2. Especially WLSE with 
the following weightings are asymptotically optimal: 
mM; 
Wet =D (iy — %.) (%yj — 2.)'/mi. : (31) 


py 


WLSE with such a weighting are two-stage estimators. 
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For stationary errors we know an analogous statement (Hannan, 1971; 
Robinson, 1972). For normally distributed errors the GLSE is also asymptoti- 
cally optimal under various other criteria; details are to be found in Chapter 1. 
The optimality of the GLSE under normal distribution with known covariance 
is obvious from the statements for the MLE made in the present section. 
In case the number of replications in each design is mp, it also follows for 
My —> co that 


£{V mo — d)} > N(0, -D(d)), (32) 


where °D() has essentially the same form as in the general case, except that 
the weightings h; have to be omitted. 


3.5.1.6 Asymptotic covariance of the estimation of the structural parameters 


In most practical applications it is useful to have the asymptotic covariance 
of the estimator # of the structural parameters separately. The accuracy of 
the estimated design points fi; is not so interesting since the uw; often only play 
the role of basis-points in a ‘gauge-experiment’ for the determination of the 
structural parameters of actual interest. 

We only consider GLSE. The formulas for the practically more important 
two-stage GLSE are the same. The asymptotic covariance of # we obtain 
by means of the known inversion formula for block matrices: 


3, = @D(#) = (Dz — DuzDz Dix), (33) 


where the D;, D,, D;:, arise from the natural partition of ©D/(5) in (24), where 
the weighting matrices leading to GLSE are inserted. With elementary trans- 
formations we obtain by using (27), (28), (29) 


Set = LY hOimnrdig’?(I — Pe,) Zig’? O.mio- (34) 
i 

Py, is the orthogonal projection on the linear subspace 

L£, = RX” Oeuio); (35) 
where guin = [La, | Oerio]- 
But now 

if —— Py», = Py2, tee = Rx — Grr io Lae); (36) 
which finally yields 

psa Dy Rene fa ((—@erio I) Ziol —2:710 | J\\* Ox1 0 « (37) 


i=1 
This is a natural generalization of some known formulas that have been derived 
for functional relations, i.e. in case 7; = 79. The case d, = 2 is treated in Dolby 
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and Lipton (1972), and the case m; = mo, Yjg = Xo in Dolby and Freeman 
(1975) for d, = 3. The results derived there only resulted from the formal in- 
vestigation of MLE and the information matrix without asymptotic distri- 
bution and optimality statements. 

Finally we apply the results to the simple bivariate linear models. From (37) 
we can obtain a bound of the covariance of 6 = [4, 6] by computing the asymp- 
totic covariance. The inverse of the information matrix converges for replicated 
observations of a fixed experimental design to the asymptotic covariance of 
\/n (# — 29). (This will also be valid for all sequences of experimental designs 
the number of design point of which increases ‘slowly’ relative to the numbers 
of replications). 

Because of Q4,)(% + B&F) = (1, &:), Or; = B, Oc5 = 0 (37) yields (cf. w(f) 
in Equation 3.2.1.1 (23) and 3.8.1 (4)) 


Me 
Fen = E90 (2 fa) 28) = as + 0s) 


We obtain a consistent estimator by replacing é; and s,(8) by &; and si(b); 
respectively. The covariance estimator itself we get by inverting and dividing 
by n (see also Barnett, 1970, for o, = cos, c known): 


_ ate) 
18g a bs aos Nie a (38) 
m 


Take into consideration that this formula is formally also valid for the case 
of nonreplicated observations (Barnett, 1970), but that it does not provide the 
true asymptotic covariance, which Patefield (1977) gave under the conditions 


lim € < 00, sion lim Dijin coz 


n—>oo N—>0o 


(see also Section 3.4) and o;¢ = co; (c known): 


Deion e clear) 
(Co Bae Bee ie 
: oye Uae) n(1 + 7) 
D°(&, P) = ae: —_ (39) 
n y Co Ei 


—1 
with 
T = Cos/(c + B?) o; 
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A consistent estimator D of n - D™ is 

(c + B) of ( m(1 +t) + dzy/(np) —#.(1 + 2) (40 
day —#(1 +4) 1k x ) 
with 


t=n-c-oB/(c + 82) diye 


If o5 is unknown, we can use the consistent estimator 2n63(n — 2), where 6; 
is the MLE from this model (Kendall and Stuart, 1961, (29 56)): 


65 ore (d, ae 2Bday a Bd,,)/2(c ate B?). 


3.5.2 Comparison of MLE and two-stage LSE in linear functional 
relations with nonrandom unobservable variables 


We consider the model denoted in (3.1.38) as LIFU* with linear regression 
part, which turned out to be the reduced form of a linear simultaneous equa- 
tions model. Thereby wesuppose normally distributed errors with the covariance 
I,, ® & in the distribution model, where 2 is unknown: The corresponding 
distribution approximations were computed by Anderson and Sawa (1973) for 
the 2SLS estimation, and by Anderson (1974) for the MLE. The probabilities 
of falling into certain intervals computed on this basis by Anderson (1974) 
permit an asymptotic comparison of accuracy, where the calculation is based 
on an experimental design which is spreading. The results for known 2 are 
compiled in Anderson (1976). 

The following comparisons of accuracy for p = 2 have fundamental cha- 
racter for general multivariate LIFU*, because there is no doubt that they 
can also be projected onto the multivariate case with their qualitative state- 
ments for finite sample sizes. This is especially important, because corres- 
ponding distribution approximations for p > 2 demand an excessively greater 
technical expense. 

Comparing the accuracy, there arise some problems, which we will explain 
briefly (Anderson, 1976, p. 8). Taking the MSE criterion, the 2SLS-estimator 
has to be preferred to the MLE, since moments to the order m — 2 of the 
2SLS-estimator are finite and all higher moments are infinite (Mariano, 1972). 
With this, the 2SLS-estimator has finite variance for m = 2, but the MLE 
does not. However, the probability that the MLE will fall into an interval 
around the true parameter may be greater than that of the 2SLS-estimator, 
namely for most of the practically interesting intervals. 

To describe the approximation of these probabilities we need the parameters 


SO om ESy,vé', Ces (0, 1) 2[—8B, 1], (41) 


24* 
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r= SVY_B, 1] a(—B, 1) S¥2/det [2] = xé,/det [I], (42) 
x := (Boz — o45)*/det [2]. (43) 


The parameter t is a measure for the spatial dispersion of the experimental 
design in proportion to the dispersion of the error. The approximations are 
carried out up to the order O(r~?). Thereby the following assumptions are 
included: 


1; We [OV Mees N= Ny + Ng; 
2. [M,J=1 and &,) +0; 
3. 2/(m — n) os is a bounded sequence (Anderson, 1977, p. 512). 
Then (Anderson, 1974, 4.27), 
Py := P(\8y — | S 26,/z) = Oe) — O(—2) 
+ 1-1(—(n, — 1) a + (2% — 1) x8 — x25) D(x) + O(r-?). (44) 


It is obvious that for x = 0 also Py. = 0 + O(r-?), and £ is the median of Bu 
up to the order O(t~?). 
For the 2SLS-estimator the probabilities (Anderson, 1974, (5.4) are 


Prs := P(|xs — B| S 26,/ x) = G(x) — O(—2) 
+ HH(n, — 1) — (n, — 1)? x) x 


+ (2n,x — 1) x — xx*} O(x) + O(r-*). (45) 
This yields as the difference of the two probabilities: 
A= Py — Pos 
= P(x) (n, — 1) x(((my =i? — 2ncc*) |x) + O(r-2). (46) 
For ; 
% S 2(n, — 1) (47) 


we have 4 < 0 for all x, and ys is better. If (47) does not hold, then 4 < 0 
only if 


x < ((m, — 1)/2 — 1x)? < ((m, — 19/2)”. (48) 
This yields the following result. 
Theorem 3.5.4 Under normal distribution the 2S LS8-estimator for x << 2/(n, — 1) 


ts uniformly better than the MLE, and otherwise over intervals (—2x, x) with 
ax <((n, — 1)/2)"?, wp to remainders of the order O(t-®). 
) 


A corresponding statement follows from the comparison of the MSE by 
asymptotic expansions approximating the true distribution (Anderson, 1974, 
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p- 572). The distribution for By for a known 2 was given by Mariano (1973). 
The asymptotic expansions are the same up to the order O(r-*/?) under a 
certain assumption. With unknown Z, the Wishart-distributed estimator 
Szw with m —n degrees of freedom is used instead of the known matrix Z. 
According to Anderson (1977) this yields under the assumption 3, the stated 
equality of the asymptotic expansions. 

Approximations of the distribution of the estimator 6 = tan-1 8 of the 
angle # = tan"! 6 were developed in the same way; for Y = ol the distri- 
bution of by does not depend on # (Anderson, 1976, p. 4). The same is prob- 
ably also true in genera) multivariate LIFU* for the Eulerian angles of the 
subspace = S,. 


For LIFU* without regression part, Potefield (1976) investigated the va- 
lidity of the approximations for the resulting #-estimators corresponding to 
(44) and (45) in a simulation study. It turned out that the approximation is 
approximately valid in the region m < (2/3) (r + thes where r has the follow- 
ing form under this distribution model: 


t = (1 + fF) Seo. (49) 


The asymptotic expansions considered by Anderson (1976) were based on 
fixed m, B and t => oo. Another asymptotic expansion is obtained for t —~ 00 
with fixed t/m and f. Then (Patefield, 1976, p. 46), 


by — 0) ~ N(0,1 + (m — 1)/z). (50) 


results. It seems that this approximation is true in a somewhat greater region 
of rt and m (Patefield, 1976, p. 56). 

It is much more difficult to get approximations of the distribution for p > 2. 
First, in Sugiura (1976, theorem 1.5) we can find such a distribution of the 
smallest eigenvector for the case gy = 1, where the normalization used there 
corresponds with that of Theorem 3.2.9. But one has to make some formal 
adaptations in order to be in the position to transfer the results to the MLE 
or even to consider general eigenvalue estimators. For q > 1 the joint distri- 
bution of several eigenvectors has to be approximated. 

Statements on the relations between the asymptotic expansions explained 
here and the approach with small errors, ie. ¢ +0, which was studied by 
Kadane (1970, 1971), are to be found in Anderson (1977). 


3.5.3 Comparison of modified MLE and two-stage LSE 
in linear functional relations 
with nonrandom unobservable variables 


In the following we will consider the same models as in Section 3.5.2, ie. 
LIFU* with linear regression part. Now the models with p > 2 re included, 
while the assumption g = 1 is kept. We represent the results on the approxi- 
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mate comparison of accuracy by Fuller (1977) for the modified MLE and 2SLS- 
estimators developed there. The modified MLE and 2SLS-estimators were 
introduced in (3.3.28) and (3.3.29), respectively. Now we suppose the sequen- 
tial model to be defined by a series of matrices W(,), m = 1, 2,..., where 
only n, is fixed and n, may increase: 


1(Wm)) =n=m(m) +n, m= 1,2,... (51) 
For the series of parameters M, = M,(n,) = [, 7], € € Mip-g)xn, We assume 
lim S3/n; = D; € M>_y. (52) 


To simplify the derivation we suppose as usual (3.2.81), 


OY’ == or Py, = Py + Py: (53) 
as well as 
UU jit. (54) 


which is achieved by the transformation following (3.2.81) 
& = ((00")" Ym) 0, 
(55) 
M, := M00’ y2/Vm. 


Thus Fuller (1977) was able to show that the kth moments (k = 1, 2,...) of 
the modified estimators are bounded for all m > m(k) and that for the esti- 
mation of the parameter b € IR?-1: 


(« — 1) 
m 


E(Bypg — 6) = Sz'(Zse} Xs) (1, —b] + O(m-*), (56) 


E(bym — 6) = ee S7"(2Zse | 25) [1, —b] + O(n’). (57) 


Here S; * — 0,(n;"). With 
Zz = (—0',1)2[—6, 1], 2 es 2 (—8’, 1) [251 Ze] 
we get with MSE (b.) := H(b — 6) (b. — 6)’ if Eb. = b. 
MSE(byu) = m-1Sz' 2; + m-2Sz' LZ; tr (Sz*Es) 
— m-?2(% — 1) Sp LzSz' Zse 
+ m-*(n, — (p — 1) — 2(« — 1)) S712,S712; 
+ m-*((2 — «)® — m, — p + 1) S7125,2 5877 
+ O(m-8). (58) 
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If the transformations resulting in (53)—(55) are taken into account, we have 
to write this equation with the variables of the original model. However, one 
also has to transform the the estimator. As one may conjecture from the 
equivariance of WLSE (cf. Section 3.2.3.3), the 2SLS and the LIML will be 
also equivariant. Therefore we may assume that both sides of (58) then underly 
the same quadratic transformation corresponding to U C20: 
Furthermore, 


MSE(by2s) — MSE(6ym) 
— 2m-*(n, =p, +L 1) S712 52'¢587* + O(m-) ‘ (59) 


Moreover, for « = 4 we obtain estimators whose MSE is uniformly smaller up 
to the order O(m-*) than for each smaller «, since the MSE depends on « by the 
term 


—m~*2aSe*(LseS7*Lz5L aims = 272355") 
-m-*(a? — 4x) Sz1Zy¢Z 7557? + O(m-8). (60) 
Because of 


SPL S12, — Sz1Z5,Zz9S 
and 
SPL zgS7 Leg — Sz! X52 XegS7" 


are positive definite, we obtain the statement. 

These results explain the somewhat poor accuracy of the MLE compared 
with the 2SLS-estimator in several Monte Carlo studies. If we once neglect 
the modifications to generate finite moments, then the 2SLS-estimator is a 
k-class estimator with « =n, —p-+q, while the MLE for « = 0 results. 
Thus we can expect a greater accurcay of the 2SLS-estimator for LIFUt 
with linear regression part in general. Fos the same value of « one will have to 
prefer the modified MLE because of (60). 


3.5.4 Asymptotics under dependent errorsin linear functional relations 
3.5.4.1 Minimum contrast estimation 


The minimum contrast estimator developed by Robinson (1977) shall be 
studied in this section (cf. Section 3.3.2). Only the consistency of the estimation 
is proved in detail, while the asymptotic normality and further aspects will 
not be inquired into. As an immediate consequence of a law of large numbers 
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there also results under very general assumptions the inconsistency of the 
OLSE if we do not have a regression model. 
Provided that the assumptions A1, A2 from Section 3.3.2 are satisfied, 


Se foe Ty + 3s, | ToB’ 
lim Sz[n = |---|") = (-------|--~------ = S(y) (61) 
Mares) BT, | BT,B' + 5, 


y 


is true almost surely. As a typical example we consider 
m(Sx — 2,) = (Sz— DY Ti) + Sep + Sez + (Sz — %5,) 
t=1 


(Lies Pe): (62) 


Th. 


i 


By (3.3.47) the last term converges to zero. 

Assumption Al also yields that the other terms are sequences of so-called 
martingale-differences which have uniformly bounded (c/2)-moments because 
of A2. For example, 


E(§ 6; | B;) = E(§,E(d; | Bs,)| Bi) = 0 (63) 
and 
Bd §;\\"!? S (Ello ° E\\§ I°)/?. (64) 


Consequently, the terms in (62) are of the order 
O,(n@lO-1 (log n)**l¢(log log n)2/*) 


(Robinson, 1977a, theorem 1). Under the assumptions A3 and 24 + 0 we get 
for the OLSE: 


SxySz Gr BT (To + Sj)t+B. (65) 
Conclusion. Under very general assumptions the OLSE is consistent if and only 
if the &; are regressors in a multivariate linear regression model. This holds both 
for random and nonrandom experimental design. 

Under a uniqueness assumption we can show the consistency. 
Assumption A5 Let Ay = A(y), Qo = Q(yo), b(yo) = ho. Then the equations 

A(y)= Ao, By) = 2, hy) = hg (66) 


only possess the unique solution y = yo over yp € VY. 
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Theorem 3.5.5 Under Al to AS } = (zy) ts consistent. | 
Proof. Because of A1 it uniformly holds for all y almost surely, that 

L(y) > Up) := —log det [2] — tr [2-7 (y)] (67) 
(cf. (3.3.31) —(3.3.41)). Then 


L(y) — Uy) = h(y) + bly), dart aaregt) 
L(y) = —log det [2,Q(y)] + tr (QAy)-1) —(p — 9), (69) 
lo(y) = tr ((4o — A(y)) 2.(49 — A(y))’ Qy)). (70) 


From the representation by the eigenvalues and the inequality log w < x — 1 
it follows that 1,(y) = 0. The equality is equivalent to Qy) = Qy. Likewise 
1,(y) 2 Ois true with the equality if and only if A(w) = Ao, because X, > 0 
and 2(y)-1 > 0. 

Now the usual conclusion principle for minimum-contrast estimators will be 
applied. Namely, by V5, U(yo) > Uy) is valid if y + yo. If there was only one 
subsequence {pm} & {yn} converging against ~ = yo, then 


0 SLPim) — Upo) ar UP) — Upo) < 0 


would have to hold, which would be a contradiction. 


3.9.4.2  Identifiabslity 


With given values Ay Ons lg the equations (66) can have arbitrarily many 
solutions. But we can always find a sufficiently small neighbourhood of yo 
that the solution of (66) is unique. This follows from the implicit-function 
theorem provided that the necessary condition on the constant rank of the 
Jacobian matrix is satisfied. This procedure can provide only a local identi- 
fiability, in the sense of the local character of the theorem on implicit functions. 
In practical applications of the described estimation method we will have to 
rely on the fact that we can revert to a reasonable small neighbourhood of the 
parameter, in which the identifiability is also secured. 

Finally, we want to mention the identifiability condition for a case that 
often arises in applications. For this purpose let the matrix 7 be regular, 
which was not demanded until now. Furthermore, let us assume that no de- 
pendences exist among the elements B, 2%, 2, in h. For the local identifiability 
it is necessary and sufficient that the null spaces of Ogh(Bo) I, ® Ty ', eXsh(2s,) 
and é@;,h(X,,) By © Bo have a nonvoid intersection. With this we have at 
least one criterion that provides — even if not the global — but at least a 
local identifiability. 
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3.5.5 Nonlinear models with increasing experimental design 


For the Gauss-Newton iteration given in Section 3.3 that approximates the 
WLSE, the asymptotic normality can be shown under weaker assumptions 
for the distribution of the errors. The following assumptions are used: 


V1. The functions 7; have continuous and uniformly bounded first and second 
derivatives. 
V2. Let 


ly — To = Op(m-), m= mbm (71) 
hold for the initial estimator 2. 


Furthermore we assume that we have the model with replicated observa- 
tions and as initial estimator of é;) we choose: 


Sia := %.. 
The advantage of using %;. consists in its simplicity and in the simplifica- 
tion of the proof in Theorem 3.5.6. 
V3. Let the $j) be mutually independent and let 


Boin=9, Doin = X= 2h. (72) 


V4. 2, is known and regular. 
V5. lim I,,2,, = © € Mp, Le. 2, = O(i5}). 


m—>oo 


V6. 2,/m is regular for m > p and 


lim 2)o/m =: 2, € Me. (73) 


—>oo 


V7. For a x > 0 it holds for all 2, m that 
E |S imo? ** SK < 00. (74) 
V8. Let » = o(1), ie. I-? = o(m-"?2), 


The last condition may be characterized as ‘weakly increasing size of the ex- 
perimental design’ compared with the number of replications. The acquisition 
of the initial iteration 2,, the treatment of unknown covariances of the ¢; and 
the importance of assumption 7 will be considered at the end of this section. 
The asymptotic normality is proved in several steps, in which it is shown that 
the iterations for the parameter, like &,, 2,,, etc., converge with a certain 
order against the true parameters &j9, 2'x9 etc., and that this also holds for zy. 
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Following that, the limit theorem of Ljapunow can be applied. The proof results 
by means of an idea investigated by Fuller and Wolter (1982) for the case g = 1 
and by various modifications also for more general models, especially for 
gi> 1. 


Theorem 3.5.6 §;, — 9 = Op(I-1/). 
Proof. For arbitrary a > 0 we can choose a 6; > 0 with 
Ds} ) [bi < a. (75) 
From Tschebycheff’s inequality we obtain 
PST Sp] > p 2? - by) (76) 
< D(S{)/(bj- p- pt) <a. (77) 


Thus the theorem follows from assumption 1 by the definition of order for 
random vectors. 


Theorem 3.5.7 Extending Theorem 3.5.6 we obtain 


Gi, — fig = 5, + Op(m-¥2) (78) 
with 

J, = (Zi Orrin) Jai zr iol) ee 0:79) Fin. (79) 
Proof. Because of 2, — a = Op(m-"/?) by A2 the Taylor series yields 

UE Ti 9; ie Bo a 

= Big — Oerio(Eir — E10) + Op(m-¥?). (80) 

Due to the boundedness of the second derivatives, 

O%iy — Opin = Op(-¥?) , (81) 
is true by Theorem 3.5.6. Thus 


Yi — Ni = Fin — O7in(Sir — Si0) — Or(-¥?) (Sir — €i0) + Op(m-?) 
(82) 


By Theorem 3.5.6 the third item is Op(J-1). Then it follows from (3.3.68) and 
Op(max (1-1, m-"/?)) = Op(m-¥?) because of V8 that 


OprinE*(Bin — Oerio(Sir — Fi0)) + 2*(8i0 — Perio(Sia — Ein) 
+ OprinE (ae, — Fix) + Z(t, — £1) = WOp(m*”). (83) 
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Now 
xv, — Sis =X; — Ein — (§ — Ei) = Sing — A§;, (84) 


aud we get 
(AirigX*Oerig + L%0erig + Orink” + LZ) A,§; + 1- Op(m-¥?) 
== Oripd Eig + VEig + Orin dS ig + L°S in (85) 
or, equivalently, 
(I } Orig) SLL | Ogrig) 418i + 1 - Op(m-?) = (It Ggrig) 2-Sio- (86) 
Because of X-U-! = O(1) and dri9 = O(1) we obtain the assertion. 
Theorem 3.5.8 With &9 = (—Osrio | Ip-q) Sin we have Ed ei) = 0. 
Proof. The assertion immediately results from 
(I) ér9) [Arn i I]=0. Of 
Theorem 3.5.9 
Sy = Zi + 1-Op(t-?). (87) 
Proof. 
Sa = (—Ora D) L[eral] 
= (—2:ri9 + @:(rio — Tir)! I) 2[—Orio + O79 — Tun) I) 
= Sin + (Arion — Pia) 1 0) S[—Ario tT) 
+ (—€sri } Z) Z[8(rio — Tin) | 0) 
+ (8(ris — rio) (0) 2[G(ti — rio) 10]. (88) 


In Theorem 3.5.7, 01%, — Orig = Op(l-/2) was already used and because of 
Det == O( 1), 


Sele 2 ol Osliet 2). (89) 
For regular matrices A, B, 
ASE BENS Are Ase 8] Ba) (90) 


Because of (2',1)-1 = Op(1) which follows from Amin(A'BA) = Anin(B) if 
A’A =I4+ D, D =O, we have the assertion. 


Theorem 3.5.10 


Zyl = Ligh? + Op(l-¥?) (91) 
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Se a ie ae fe ON 


and 
Za/m = Lyo/m + Op(I-¥?2). (92) 


Proof: For the single rows 0,r\, k = 1,...,q of 0,7; we obtain 

Ont? = Oar) + (Sir — Eio)’ Oenr$ + (at, — 209)! Onan) 

+ Op(||%™%, — 19, Si — Ei0ll*). (93) 

According to Theorem 3.5.6 it follows as in (77) that 

Onfin = On? ip + Op(max {I-1?, m-V2, F-1}) = ,rig + Op(I-¥2). (94) 
With the definition of 2,,, and 2,4 in (3.3.78) we obtain the assertion. 
Theorem 3.5.11 We have 

Ei, = Eig + A, iy(% — ™%) + Op(m-"?). (95) 
Proof. We use ; 

&i1 — Td; = & + Tio — Ta — O11 dj. (96) 
With the Taylor series expansion 


Pig = Pig + 0,8 i (%o — MH) + Oia (Fin — Sir) + Or(lot9 — 1, F109 — Sill?) 


and & — §i = Six — dio, one gets (97) 


Ein = Fin + 2,0 in(% — M1) — OrinSig + Op(l-}). (98) 


With the Taylor series expansion of @;:rj, at Orig and from V8 we get the 
assertion. Hi 


Remark 3.5.1 Tf V5 and V8 are slightly weakened with 1,1 = o(m-™), then 
the Taylor series expansion with inclusion of the second derivatives provide 
the correction terms that are needed to prove the asymptotic normality. The 
boundedness of the third derivatives then has additionally to be demanded. 
Fuller and Wolter (1982) showed this for g = 1. However, then O(max sacs 
m-12)) — O(m-1?) does not hold. 


Theorem 3.5.12 
n ~ 
Ny — My = Lag dy (On% indi Ein) + 0p(m-1?) (99) 
= 


is true, where 


Ms — Tg = Op(m-"2), 
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Proof. Multiplying (3.3.77) by 1 = (In)~1 (In) yields 


2, — 4, = (Exim)? = Sot gE)? En (100) 


NM i=1 


where these matrix terms are stochastically bounded. According to Theorem 
3.5.11 this provides 


1 ys " > 1 
a So XY Aix (l2in)* €io + (% — 1) + mie: 0p(m-12) 
“iq (101) 


The terms calculated for 2,, &;; can be replaced by those calculated over 1, &io 
according to Theorems 3.5.9 and 3.5.10. For dzriy = (@n7 Peay... it holds for 
he) seg that 


anv!) = dnr) + (Ei, — io)! Acar + Op(max (I-1, m-¥?)), (102) 
and by Theorem 3.5.7, 

= dar) + §'Osar) + Op(max (1-1, m2) (103) 
Thus we get 


n 
ede —] , v—lz 
Me — My = 279 DY) OnTin~in Ein 
a 


+ (Zz /m)-1 = DS (An—7$5;)*= ar s! (Says Eig -+ op(m-1!2) 
N ¢=1 
+ Op(E-¥?) ¥ Op(Eio)/n + (0p(max (-4,m-12))) Op(E io) | 
v—1 i=1 
+ Y Op(-¥?) Op(Eio)/n- (104) 
i=1 


Now &j9 = O(1) Sig = Op(I-1/2) = op(m-"*) and with that the remainders are 
op(m-/2); for instance, the last but one summand is 


op(max (J-4m-4, m-!4)) = op(m-1?). 


Except for stochastically bounded terms there only occur linear combinations 
of products of the components of 6; an déj) in the second item. These themselves 
are again bounded functions of $i9, which have expectation zero by Theorem 
3.5.8. Then this term is of the order 


Op(||Ciol|?) = Op(I-4) = op(m12), 
For the summand we obtain Op(m-¥?). 
Theorem 3.5.13 


£m (2, — m)} > N(O, 271). (105) 
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Proof. Because of the boundedness of the derivatives of 7 and JJ — O(1) it 
follows for 

L-U2g* :— Oar ig(S yl)-2 & iol? (106) 
according to V8 that 

B\l-e8 2** < Op(1) H\P%igl2** < KOp(1) < K. (107) 
Further, 

D(-V68) = Laight (108) 
hence by V6, 


m—>co i=1 m 


lim D ( S ret | = lim Z,l-! = lim n,,Z, =limn-O(1)= 00. (109) 
m 


Thereby the limit theorem of Ljapunow (cf. Loéve, 1977, p. 289; and Remark 
3.5.2) yield 


24 {253 Sel: KOS )e (110) 
ii Hy 
By V6 and from the moment convergence theorem (cf. Loéve, 1977; p. 186): 
m 
L {im SS. et} +> N(0, 55%). ' (111) 
i=1 


The left-hand side is \m (%_ — mo) + 0,(1) according to Theorem 3.5.9, hence 


(111) also yields the limit distribution of Vm (%. —7%). 


Remark 3.5.2 The relation (107) only provides a condition for the limit theo- 
rem, which in practice may be checked in applications. This convergence al- 
ready holds for (univariate) random variables #, = [-s¥, i = 1, 2,... if for 


nam 
some # > 0, € > 0, and D (> ) =,? o the relation 
i=1 


Blxe|?** =e Die|, t= 1,2,... (112) 


is true (Loéve, 1977, p. 289). We shortly will demonstrate that the condition 
(107) and V7 yield these general conditions of the Ljapunow theorem, because 
they yield for t S x/2 


Blae,|2** = (Ela;|?)¥? (Bla,|0+92)¥2. (113) 
We choose a constant c > 0 so that Hlajc|? > 1, then it follows from (113) and 


= (E\cae,|2)/2 — (EB \car,|2+2" uate") = (E\car,|2+2* 1/2 (114) 
that 
Cr rn |a,|2* 5 = c2 HB |x; |” (pd Je hs i (115) 
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With this, 
r= k= 22, CLK, (116) 


for (112), where we still take into consideration Hx; = 0. The same limit 
theorem holds for multivariate x;, too, where in (112) the norm ||z;|| appears 
instead of the absclute value |z;|. The relation (112) is fulfilled for random 
variables x; with uniformly bounded support (Loéve, 1977, p. 289), that is in 
almost all practical cases. 


The initial iteration x 


It can be shown that under further suppositions the OLSE satisfies condition 
V2. The OLSE z, minimizes 


/ 


~ 


bie, 2) = — ¥ we — len ale (117) 


nN j=1 


For details see Fuller and Wolter (1982). 


Unknown and different covariances 


With different covariances over the single design points the proof remains the 
same in essence. Instead of the Ljapunow limit theorem a central limit theo- 
rem for linear forms has to be applied. Details are to be found in Héschel (1979) 
for the case of equal numbers of replications over the single design points. 

With unknown covariances we will also obtain the asymptotic normality 
for the estimators given in (3.3.85) under certain additional assumptions on the 
number of replications compared with the size of the experimental design. For 
this purpose the asymptotic investigations can be carried out for random weigh- 
ting matrices, as it was done in Chapter 1 for regression models. A detailed 
derivation is omitted here. 


3.6 Testing hypotheses in linear functional relations 


Modelling with LIFU faces the following problems: 


— How great is the dimension of the subspace £ in which the design points ju; 
are contained? 
— May certain subspaces from the model left out? 


Statistics for testing corresponding hypotheses turn out to be closely related 
to linear hypotheses in the multivariate linear regression model. 

The present section is based on Anderson (1951a, sections 3 and 4). We start 
from the LIFU* with linear regression part according to (3.1.28), where a 
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normalization as in (3.2.64) is assumed for the matrix of the linear restrictions L: 
Z—=MU+M,V-+6, DO bg Os 
Pe Nee Ue OOK oe 8 ea.0y lL Ls ti Te 
Mie Monn, W:=(TiVIEMixm, M := (M,} M,). 


Here the observations Z and the matrix W (of instrumental variables) are 
known. 

Unlike the linear regression model, where W would correspond to the re- 
gressor matrix, it will not be tested in this model whether HZ lies in a subspace 
that is independent of the regressors, but which consequences result from the 
special choice of LZ. 

First we want to investigate the problem as to whether the number of re- 
strictions may perhaps be reduced. Thus we consider the test problems: 


(TT) Hell j=n; Kieth )=a, 2<n 

(T2) Hor L*| = 453 K, : 7{Z*] <q. 

We investigate whether there arise certain given restrictions in the model, i.e. 
for fixed Lo € Wh, we test 

(T3) eee de Ign a= De. 

Finally, the problem 

(T4) Hea La = 0; Kes 0 


is considered, with which we test whether there exist linear restrictions in the 
model at all. Although this question should strictly speaking be answered at 
the beginning of each further investigation of the LIFU*, this problem is 
treated after (T1)—(T3), because the respective test statistics can be derived 
from the results belonging to (T1)—(T3). 


3.6.1 Tests on the dimension of the subspace 


We test the hypothesis that the rank of L+ is exactly q, against the alternative 
that it is g2, where q, is a fixed number not greater than q, and g, < p. Accor- 
ding to Theorem 3.2.9 the likelihood ratio criterion is, for normally distributed 
observations, 


i a ve 
a il (1+ a)" TI (1 + Ayes — I] (4+ A) 8 (1) 
ee eo l=q2t1 


Ay = A(Sz-wQz.v-v)> Ay Ss SA, 
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Froin this we get as the likelihood ratio test for g, against all alternatives q, < q,: 
1 : 
—2 loga = >} log (1 + 4). (2) 
I=1 
For large n this yields approximately 
q 
n > A 
1=1 
a criterion which was suggested by Fisher (1938) and Hsu (1941a, b). Anderson 
(1951a) showed the convergence 
L{—2 log Aim} > Xi; a=qnlrn —pt+), (3) 


in case Sy.y/m possesses a regular limit and L+ is of rank q,. For (T2) we 
obtain as test criterion 


ve 


A=] +a). ag 


l=1 


A further test based on newer methods was suggested by FPujikoshi and Veitch 
(1979, p. 351). 


3.6.2 Tests under a given subspace 


For fixed Z+ € N%,., according to (3.2.107) we have 


PXq 
Q(L*) = L'Oz.u.~l* (5) 
= L''(O2.w —Q2.7) f+ = LYZ(Py — Py) ZL. (6) 


As is obvious from the representation of HZ, (cf. Bunke and Bunke (1985), 
[A 2.13]), 


Q(L+) ~ Wm, LY’ ZL+, L'M,U(Py — Py). (7) 
Under the hypothesis in (T3), 


QO = OL) ~ Wa(m, Ly'XLz). (8) 
Moreover, 
S = Ly’Sz.wLly ~ Wom — n, Ly’ ZL) (9) 


is distributed independently of @ because of (Py, — Py-) Pyi = 0. Thus, 
the criteria developed to test the linear hypotheses (cf. Bunke and Bunke, 
1986, 5.1.3, 5.1.5) can also be applied to this problem. Here we only mention 
the likelihood ratio test. By (3.2.107), 


Oz.u0.v + Szw = Szv- (10) 
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Under the hypothesis we then have 
A = det [S] det [S + Q]-1 = det [L2'Sz.wL2] det [Le'Spy Lo}? 
~ U(q,m — n, 2) (11) 
(for the definition of U(.,.,.), ef. Bunke and Bunke, 1986, [A 2.24]). As likeli- 


hood ratio test there results 


1 f A U_« ) ELLY) 
(2) -| or < (q,m — n, n,) a5) 


0 otherwise 
For q = 1 and q¢ = 2 we obtain an F-statistics for this (cf. Bunke and Bunke, 


1986, 5.1.5. B, table 5.1, where also further epbrozania vious for the test statistics 
(11)—(15) are given). 


test statistics distribution 

si (eae 

a ey (m — n) (1 — A) i hile (13) 

mA 
(m—n—1)\(1—YVA 

q= 2: T, — (man—-a(t A) Pon,.2(m-n-1) 5 (14) 
nm VA 

m>n: —(m — 7) log A PY ny (15) 


The latter approximation is true if Sy.y/m has a regular limit: The test is con- 
sistent and unbiased as in linear regression models. 


Known error-covariance. In this case 
O = OLR) = LYE-Oz.y.yS-WLd ~ Wi (m1, Lp'Ly) 


under the hypothesis Z+ = L;. 

This makes it possible to obtain 7?-statistics from Q by producing quadratic 
forms of the kind a’Qa. The tests result by determination of the critical re- 
gions over a-quantiles of the y?-distribution. 


Ose Tests on the existence of a linear functional relation 


We consider the problem (T4). In this case, under the hypothesis there does 
not exist any vector J+ € IR? for which 


“’M, =0. (16) 
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This suggests the following test: the hypothesis is rejected if the corresponding 
test rejects (13) for all /+ € R®, which happens if the minimum of 7(/*) over 
+ is greater than Fy.n,m—n- But the minimum is the smallest eigenvalue A, 


oh S7iwQz.v.v- Thus we get as an «-test: 
Ny 
1 UA Emin 
p(z) = (17) 
0 otherwise 


This technique can also be applied to test 7[Z+] = ¢ against 7[L*] <q. As 
the critical region one obtains (Anderson, 1951a, 4.13) 


q 
Il (i Ae 4,)} Ss WAGE m— i, M); (18) 
l=1 
where 
A(SzwOz.u.-v) S ++» S Aq SZwz.0.r)- 
3.7 Confidence regions in linear functional relations 


As in Section 3.6 we consider LIFU* with a linear regression part. Till now 
the available results have not been as comprehensive as the corresponding 
results for linear regression models (cf. Bunke and Bunke, 1986, ch. 6). The 
confidence region is constructed based on the test for unknown error covariance 
by Anderson (1951a) given in Section 3.6. The respective results for known 
error covariance result according to Section 3.6.2, where details for bivariate 
LIFU* are also to be found in Kendall and Stuart (1961, 29.22). However, 
only the case g = 1 is taken into consideration. For g > 1 there arise identi- 
fiability problems in the determination of Z+; for further explanations see 
Anderson (1951a, 5.2): . 


3.7.1 The case of subspaces of codimension one 


In this case we have L+ =], € R? with a normalization [)Alp = 1 for an 
arbitrary A€ Mt, with 7[(M, | A)] > r[M,]. Consequently, a confidence 
region for L+ consists of those vectors! € IR? for which the test given in (3.6.13) 
does not reject the hypotheses. 

For known covariance 2’ we can achieve a simplification by using A = JY; 
Namely, l’Qz.y.yl and lU’Sz.wl are independently y?-distributed. In the first 
case 

aoe a {l | Al = 1, T (1) = Day reyes) (1) 


e 


results as the (1 — «) confidence region. 
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For A= 2 there results an «-confidence region at the level « = (1 —.«,)(1— 2) 
as the set of those J € IR? for which 


VOzuvl S Nats (2) 
Neson = USz.wl = Vin (3) 


where these «, and a, are chosen in such a way that for / = J, the significance 
level is attained. 


3.7.2 Consistency of confidence regions 


Among the possible confidence regions one wants to chose consistent ones. 
That is, for any fixed ly with m — oo the confidence region becomes arbitrarily 
small for an arbitrary high confidence level 1 — «. 

We will demonstrate this for a special case. From inequality (2) we get 


V(Q/m) 1S Ze 5n,/™- (4) 
The right-hand side becomes arbitrarily small for large m: 


~ 


1 pa a 
Q/m = — ZU'(UU’) 1 UU'(UU')-1 UZ’ 
m 
with 
x 1 1 
U = U(I — Py’), lim — UU’ (- lim — Sor] E Me . (5) 
m 


Moreover, ZU’(0U’)-? is arbitrarily close to M,, with a probability converging 
to one. Provided that 1;M, + 0, then m can be chosen so large that 1,M, 
satisfies (4) with arbitrarily small probability. 


3.7.3 Bivariate linear models 


From the results of Section 3.5.1, based on the asymptotic distributions, it is 
easy to construct the related asymptotic confidence intervals, which will be 
shown for the case of nonreplicated observations. For this purpose we denote 
the estimator of the asymptotic covariance derived in (3.5.40) by Dé a, B). 
Then it approximately holds that 


[4, B] ~ (La, ], °D); 
hence a y-confidence region for [«, B] is given by 
= ([a, BI | Illa, 6] — (4, Blllep- S x52, 


where 7?.. is the y-fractil of a central y?-distribution with two degrees of free- 
dom. Here we can give ®D~! as the i inverse of a (2 X 2)-matrix even explicitly, 
but we omit this here. 
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3.8 Numeries 


In this section we describe some algorithms to calculate WLSE, for which 
there are already available some results from practical applications or which 
can be simply realized by standard programs, especially for: 


— bivariate and multivariate LIFU; 
— polynomial relations; 

— explicit models; 

— implicit models. 


In principle each general method to minimize quadratic functionals with 
restrictions on the parameters can be taken to compute the WLSH, especially 
general methods for the solution of the nonlinear normal equation. However, 
with the corresponding minimization problems there are np variables in the 
experimental design and at least nq restrictions. Thus for sample sizes of about 
50, which are quite frequent in practical applications, most of the available 
general programs to solve the resulting normal equations or the corresponding 
minimization problems fail. 

However, practicable realizations of the known iteration methods, such as 
the methods of Gauss-Newton, Newton-Raphson, and others, can be derived 
from the special structure of the normal equations. This section aims at offering 
an insight into the problems of numerics in errors-in-variables models to the 
statistician and at showing some approaches to the solution of the peculiar 
large-dimensional problems arising in errors-in-variables models to numerical 
mathematicians. 

The realization in concrete computer programs of course depends on the 
computing facilities available. Further, no detailed convergence conditions 
for the described methods will be given, as the related special numerical investi- 
gations would beyond the scope of this book. The fundamental convergence 
statements for the corresponding algorithms are collected in standard numeri- 
cal texts; we refer especially to the monograph by Schwetlick (1979). 


3.8.1 Linear functional relations 
3.8.1.1 Bivariate linear functional relations with nonrandom unobservable variables 
The WLSE shall be calculated for the simple bivariate LIFU* 

N=O+ BE. Bout Cy, j =1,...,m; (1) 
with the covariance 


i. 0 
Doi; = Diag ( 4 (2) 
et 
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The likelihood criterion is 
k(u) = © (CMe — &) + oy; — « — BE). (3) 


The normal equations result from the derivatives of k, (Williamson, 1968; 
see also Section 3.2.6): 


&+pe=¥ (4) 
n n * 
&= di smild si, 84 = (06; + B?o5;)- = 8,(8). 
i=1 i=1 
n 
Let @; = x; — #, then k,(u) = ¥ s;(8%; — 9;)?, and hence 
i=1 
n 
0 = d,k = ¥ (si€i(B4i — G1) + 828 os(B4; — 9;)). (5) 
4=1 
We get the following iteration method (Williamson, 1968): 
n n j 
Bass SF Sib ,X; = Ly SitiYi;, 4 = O; 17 2, osey : (6) 
i=1 i=1 
8i=Si(Bx), ti = Li(Bx) = 8i(B) (Os%i + B.o0Gi), (7 
) 
B= (P.), Gi =GilPx)- 
The estimators for é; result in 
£; = 8;(6) (oi; + Bosily; — 4)). (8) 


With replicated observations 
Zyy = Mi t+ Cy (9) 
we obtain a WLSE, which is MLE in the normal case, with the substitutions 
25> 2%, Cai > ili; @ i= 38,0. (10) 


For unknown error variances 05;, o,; the method can be applied analogously. 
Instead of the 2;, y; we take the 2; y;, again, and 05;, o,; have to be replaced by 
the estimators wyx,/(m; — 1), wy,/(m; — 1) by Theorem 3.2.9. This estimator 
even yields the MLE in the normal case. 


3.8.1.2 Bivariate linear functional relations with random unobservable variables 


The method to determine the MLE in the normal case, which was described 
in Section 3.2.1 (cf. Table 3.2.1), can be easily treated for the bivariate LIFU- 


by standard programs. 
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3.8.1.3 Multivariate linear functional relationships with nonrandom unobser- 
vable variables 


It is mainly the models with independent observations over the design points 
that are of practical interest here. The WLSE with known covariance results 
by Section 3.2.3 (Theorem 3.2.3), the MLE for unknown covariance results by 
Section 3.2.5 (Theorem 3.2.9) as WLSE with the weighting matrix Sz.y. 
Both estimators can be obtained by means of the standard program for eigen- 
value problems. 


3.8.2 Bivariate polynomial relations 
3.8.2.1 Polynomial relations 


Polynomial relations are the appropriate models for curve-fitting in many 
practical applications: 


if 


= 28 Vig 7(E A) Oe Re a et Ol, (11) 


For the observations z; = [x;, y;] of [&,7;] let the covariance matrix be 
Diag (05;, o.;). The WLSE for these models can also be obtained with the 
methods for general models. But, by exploiting the special form there result 
several simplifications. We introduce the method suggested by O’ Nell, Sinclair, 
and Smith (1969) — a modified Newton-Raphson algorithm — which can of 
course be extended to general models (cf. Section 3.8.3). Let 


n = r(é, 2) = > ap, (4), [WO nay Tt | (12) 
k=0 


be an expansion of r(€) in polynomials p, of dgree k. By a suitable choice 
of the p, we can later on achieve simplifications in the algorithm. The distance 
to be minimized is 


n 


kB) = & (oar (ei — 8)? + oa'(ys — 116i 2))). (13) 


We take 7, &(»); a8 initial approximations for the true parameter 7, €n)9. 


3.8.2.2 Newton-Raphson procedures 


We have the equations 0,k = 0, y = [m, &()] for the stationary points of k. 
For the second iteration y, = y, + Jey we obtain from the Taylor series 


ra) pln ra) Key -+- Oy, phy A.p = = OF (14) 
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the following r + 7 nonlinear equations: 


Oy yi Asp = —O,k,. (15) 
Now 

Co =p Oa(Y¥i — Ta) Onis (16) 

Ask, = —2(o5 (a; — €;) + o@(yi — ru) Git, aioe (17) 

On, nk, = —2 x Oa! (—O,7 a Ont) + (Yi — Ta) Anahi): (18) 


t=1 


It is obvious that 0,,,7; = 0 here. Then 0,,,k, results in this special model 
as the corresponding block of the information matrix (cf. (3.5.4)). In the general 
case, Oy yk, is more complicated (cf. Section 3.8.3). If we choose the polynomials 
pu; orthogonally with respect to the weights o,;', we get 0,,,k, as a diagonal 
matrix from (18). 0: has diagonal form, too. Moreover, we obtain the ele- 
ments of 07,:(,)k, according to 


An0,8k1 = —20;;' 8:((yi — rir) pilEi,)). (19) 


Based on this representation, O’ Neill et al. (1969) suggest the following approxi- 
mate solution of (15): first we take into consideration that the elements of 
0,,ek, contain less terms than that of 0,,,4, and 0; :k,, which are typically of 
smaller order. The matrix Diag ((@,,.4:)~11 (@:,24,)-1) may serve as the first 
approximation of (@,,,4,)~1 and then we get 


Tg = My — (2, ahi) Onk, (20) 


in the form 


n n 
WM = DY of yipi(En)! do 6G pi(En)- (21) 
ion isiN 
Similarly, we get 

Oa(Yi — Tin) Oerin + O57 (21 — Fir) (22) 
6. (Oerin)®? — Oa (Yi — Tir) Osea + > 


fig = CH ga 


In (22) we can obtain better values of i. if the new values of x, from (21) 
are taken to calculate the new values of 7;, 0:7j, Ozer;. 


3.8.2.3 The algorithm 


As the initial approximation we choose &,); = %n); the orthogonal poly- 
nomials p, we get from the relations (Forsythe, 1957): 


po(é) = 1, Dis) = ee 1 
pé) = (€ — 81) ial) — trpi-2(€), 


(23) 
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where 
n n 
8, =.) 64 Pi-alFa)? $51 Ooi Prs(i,)* (24) 
i=1 i=1 
n n 
ty =) >) Og Pi Pi-2afu) 25 On Dio (25) 
i=1 i=1 


According to (20) this yields a polynomial for which &,,); = %,) provides a 
curve-fitting in the regression case. As a simplification of the algorithm we 
can use the derivatives for orthogonal polynomials by Smith (1965). 

Often we need the original form (11) of the estimated polynomial, which is 
obtained from the relations 


0 lige JoS>0 Ce Wp il. 
tn = } (26) 
fonmy e107 
Uner,isn = Uni — Sinsi,s — Casi forh <1 (27) 
r 
20) = OU 541 for 0) SRS Po (28) 


jal 


As with all such algorithms, the convergence depends on the initial approxi- 
mation. Unlike other numerical problems, in which there is no natural initial 
approximation, we have such a natural initial approximation for &,,) here, 
namely x,). Surely a») lies with great probability sufficiently close to of 
&(n)9 In order to guarantee the ocnvergence of the algorithm against the WLSE 
£. Indeed, the sequence of iterations may converge against any stationary 
point of &,(z, &,)) although this probability will often be small because of the 
accuracy of the initial approximation 2,,). In applications where the WLSE as 
global minimum has to be computed, we will consequently apply special 
methods to calculate all stationary points of k, and pick out the WLSE from 
those (ci. Egerton and Laycock, 1979). 


3.8.3 General models with errors-in-variables 

3.8.3.1 Conditions for an application of the procedures 

In this section we describe the known approaches to the iterative solution of 
normal equations, where we have to omit details. We use the compact way of 


writing (3.1.26): 


Zn) = Mino + O(n)> nyo € Re, (29) 


0= 8(M(nyo> Ilo) 5 Teg € RR‘, S = 8(n) ; IR™4u+4x > Rom (30) 
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with c < d,, and for explicit models we set 
Rie ([xi, CAV ae hee & = Xn), etc., (31) 
Min) = Mn) (E(ny, %) = ([é:, ri(Si, m)\)ims ‘as ne (32) 


Let the weighting matrix Q-! for the calculation of the WLSE be known and 
moreover let the usual regularity assumptions for s be fulfilled, such as the 
regularity of the functional matrix. In case the covariance of z is not known, 
it is possible with replicated observations to obtain the weighting matrix Q-} 
by means of an estimator for Dé (cf. Section 3.2.7). 

Starting from the likelihood function in the normal case, 


1 
k, = k,(, mw, A) = olen Ulloa + A’'s(u, 2) (33) 


results as the Lagrange function for the implicit model, and for the explicit 


model 


by = kalo, 8) = 5 lle — al, mb (34) 


results as the minimization criterion. 
We denote the initial approximation by 2, Mn), and the WLSE by 4, A. 


3.8.3.2 Gauss-Newton procedures 


This method is based on the linearization of s in each step of the iteration, 
where we abbreviate s(n), =: S, and Mn), =: Ma- 


S(Mny2s I) = Sq FY 8, + O78; (%_. — ™) + 0,81 (Me — fy) © 8(A, %). (35) 


Then we obtain the normal equation for 2, Mn)g With the approximated func- 
tion 
—Q1(z — fz) = ORV = 0, 


0,842 = 0, (36) 
Sy a OS Aon + 0,8; 4a" = 0. 


This yields corrections of the initial iteration after a simple transformation 
(cf. Britt and Luecke, 1973): 


{2 = 2 — Q G),81(8,8,;2 0,81) (51 + 0,81(%2 — %) + O,81(2 — 1) (37) 


Ta — My = —(01,8;(8,8;2 Culat 081(8,8,2 0,81) 3 (s, + 0,81 (2 ae tx))- 
(38) 
Notice that uw, does not have to lie on the manifold S,,,,. 
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For explicit models the iteration x, results from (38) in the same way, but 
6,8 and @,s have a simpler form than in the general case: 


O78) = 0,(—Ni + rilSir, m))i—1 eo thie (O,ri(€i» ca eer eae (39) 
Qu81 = Diag ((Ari(En, m1)! —Ly)),_ 3, (40) 


For g = 1, p = 2, Dolby (1976b) showed that the iteration of independent 
variables &(,) which occurs in (37), can be more simply obtained for explicit 
models by means of the information matrix: 


ne 


UO eee ul 7 
( ) = Inf? (2, ema Q(z aE: /ny1))- (41) 
E(nyo Tee (ni 


Here Inf is the information matrix at the point (2, &,),) given in (3.5.4) to 
(3.5.9) and 


Aun = ((O1 Qri(En, m4)" (42) 
Oey btinya = Diag (es arrilEnn, IN IPee vce (43) 


From the way in which (41)—(43) are written, it is immediately obvious that 
the proof by Dolby (1976b) remains valid for general explicit models with 
prs: 

Hoschel and Penev (1980) showed the global convergence of a regularized 
Gauss-Newton procedure. Schwetlick and Tiller (1985, 1989) were able to reduce 
the computational effort further and improve the numerical stability. 


3.8.3.3 Simplified Gauss-Newton procedures 


Deming (1943) suggested not to perform the linearization at (2, 4»),) in each 
step of the iteration but to linearize s at the point (7, 2;,)). We do not obtain 
the WLSE (@, %) even in case the initial approximation 7, (,), lies in so small 
a neighbourhood of #, 2 that the computed Gauss-Newton method would con- 
verge, but practical studies (O’NevIl et al., 1969) show that this simplification 
provides good estimations. These can be used as initial approximations for 
the complete Gauss-Newton method if necessary. Now the objective function 
to determine 9, {4(n)2 18 


1 
eh lle — Mnello- SF 2 (s(z, 7) + On84(% — My) + Ay81(U(ny2 — z)). (44) 


We get ; 
My — % = —(018,(0,5,Q 81,8:)-* On8,)-1 4),84(8,8;2 8),8,)-1 8(2, 4) (45) 


fe = 2 — 20,,8,(0,8,2 81,81) (8(2, m) + On84(o2 — m)) (46) 
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3.8.3.4 Modified Gauss-Newton procedures 


In the Gauss-Newton method the iteration j;»). does not lie on the manifold 
Sy,x, -It is obvious to determine Hn2 2 a WLSE with given 2, hence we get 
the following iteration process: 


WLSE N 


G GN 
Pde ays CS wae 


WLSE 
TT 7 E(ny2 € Sinai m2 do" (47) 


Let x, be the initial approximation of %. The normal equations for “4, a8 
WLSE with respect to the observations z,) on the surface Sy,n, are 


Q(z = wm) + 8,8(¢ny1> M1) Ay = O 


(48) 
S(M(ny1> ™) = 0. 


The Lagrange multiplier does not occur for explicit models and the normal 
equation is simplified: 


(z ms Hn). Qt Geen) = 0. (49) 


The form of 6¢,,,(ny1 results from (43). With this we get the method for explicit 
models suggested by Fuller and Wolter (1982) (cf. Sections 3.3.3 and 3.5.4). 
mt, 18 determined by the same linearization as in the Gauss-Newton method, 
1.€. 2% results from equation (38) which contains in the case of repeated obser- 
vations equation (3.3.17), which was derived independently for explicit models, 
as a special case. However, compared with the Gauss-Newton method this 
method brings about an additional difficulty. Equations (48) and (49), respec- 
tively, also demand the solution of large-dimensional nonlinear equations. 
The solution of (49) can be obtained in the case of a block-diagonal weighting 
matrix Q-1, because equation (49) then decomposes into n subequations. There 
only the (p — q)-vector &;, has to be determined (cf. (3.3.68)). Should this 
be too expensive, the original Gauss-Newton method has to be choosen. 
However, the asymptotic properties remain valid under the assumptions of 
Section 3.5.4, because the fundamental convergence §; — &i9 = O,(m; /?) 
remains valid in spite of the linearization that is carried out then. 

A further possibility: the simple Gauss-Newton algorithm is carried out 
up to the iteration 2,, “4(»),, only in the final iteration step one applies the 
modified method and obtains an estimation s4(»),4, of the experimental design 
which — in contrast to the simple Gauss-Newton method — lies on the mani- 
fold 8, 


yi" 


3.8.3.5 Newton-Raphson procedures 


Unlike the Gauss-Newton method, the restriction s is not linearized here but 
a Taylor series expansion to second order is used for the objective function. 
Thus the method would also be applicable for the calculation of alternatives 
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to the WLSE, e.g. for MLE with nonnormally distributed errors. For poly- 
nomial relations this method was explained in Section 3.8.2; for explicit models 
MacDonald and Powell (1972) gave the respective algorithm, which is based 
on a special method to compute higher-order derivatives. 

We put y = [z, uw, A] for implicit models with the objective function (33), 
and y = [z, é] for explicit models by (34). As the approximated objective 
function we obtain 


: ; 
ke = ky + Oyky Aap -- 9 Aap Oy, yh Aoy (50) 


in the second iteration step. The stationary points of k, result from 


0 = ky = Ok, + pki Ay, (51) 
hence we have 

Yo = V1 — (G,yhi)* Oyky. (52) 
For explicit errors-in-variables models one obtains 

Ok = 2’ O20 6; C=2— p(€,2),; (53) 

Ayayle = (Oyb)’ Q-3 AC + ((Ap(Gpayo))’ Q-3 (54) 


It is obvious that the first term in (54) is just the information matrix Inf at 
the point y, (cf. (3.5.4) —(3.5.6)), and 0,¢ = —@,u(é, 2). 

This explains that the Newton-Raphson method includes an additional correc- 
tion term to the Gauss-Newton method given in (41). But, with the currently 
available means the inversion of @,,,k, in the case of medium sample sizes seems 
only to be possible for the case of a block-diagonal weighting matrix. 

Provided that u(y) is linear in x and &, the correction term vanishes in (54) 
and then the Gauss-Newton and the Newton-Raphson method ase equivalent 
for LIFU. A comprehensive discussion of modern procedures is given by 
Schwetlick and Tiller (1985, 1989). 
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Appendices 


Linear algebra 


Let the mapping A: Nt, — IR* be given by A[A] := (A,[A], ..., 4[A]) 
where A[A] (¢ = 1, ..., &; ALA] S --- S ,[A]) are the k ordered eigen- 
values of A € M,. If Nt, is understood as a linear subspace of R*, 
then A is continuous (Kato, 1966). 


For 1 < k, let 
MY) := {4 € My | Ap aL] < April A]} 
[A 1.1] implies that NY is open in M,. 


Let &,,, be the set of the J-dimensional linear subspaces of IR*. For 
L<k and J€ & 1, J+ € Mey 1), suppose LK(J+) = J+. Then, for 
AE Mesxm, m€ IN, the relation &(A) + J = R* holds if and only 
if r[(J+)’ AJ =k —1. 
For 1 < k let the mapping 7,,,: M!!! > Q,, be given by 
Mr, (A) := {eigenspace to /,_;,,[A], ..., A,[A] of A}. 
Nx,. is continuous on IM!!! (for the topology in &,,, see [A 3.16}). 
For Ap € My; let 
VO [Zi | Ao], Ly, = [—Agi LJ. 
For l<7'= p, A € MM, p-r) let 
GOTT Ward MPO ara Fores con Wreage np eal 
JI = KS), F', := (Ly! J). Then for 
C = ((Cy))iirg € MF, Cu € MZ, we have 
R(C) + J = R?® if and only if 
ACuJ=p—r, C= Fr(J*C(J+)’ + IOnJ’) Fe 
(E = 0,0;3, Ose t= On, — Cri CC 2) Then 7[C 22] = r[C] — (p —7). 
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A 1.8. 


7 
In addition to the definitions of [A 1.5], let g <7, £ € &y,p-, and 
Doce io Onn Jo= Ry). For k,leN, Ae Ma let 
o(A) := AK(Ly). Then 
(a) £\+ J = RP if and only if there is a (7*, L)'e Lg XK Miia) 
with 
La BiylT ae Ie 


(This holds iff £+ = Lz(£*)+!) 
Here £* and Pys+H are uniquely determined. 

(b) £ + Jo = R? if and only if there is a B € My,.(p-g) with £ =o(B). 
Here B is uniquely determined. 

(c) £ + Jo = R? if and only if there is a Bye Myy (p+ and a 
By € Myx (rq With £ = F,(7+ + Jo(B2)) (= o((By, Be), By = LE, 
where £ is defined according to (a)). 


£ + Jo = R? implies f + J = R?. 


In addition to the definitions of [A 1.5] and[A 1.6], let p —qsin<m, 
DGG ny My € Ie aan) Oo Nor Few et td Reon 


My = (Me € Mex m | ALM | Me]) = £, ALM, ; May) — VW} 


and for L*+ ¢ M4, L€ Mip-yxqr U € Me ms ™m =n —(p —7), 
Vee MU nea lee 
Meo1 uy >= {M, EM em |S Mm, € ean: M, € Worx (pay: 

‘MM, = MU + MV, DM, = 0,5,, 0°", =f: 
Let £ = F;(f+ + J£*) be a representation of £ according [A 1.6a)]. 
Then. J} = W824, holds if ADH)=f*"', L= —FL*, 
V = My, RU’) = W \ A(M)}) is satisfied (here ‘\’ denotes orthogo- 
nal difference). . 


For X € Mux, A € Mixes WE Mz, let C) & Mz be defined by 


Then the matrix X’C,XW is a projection matrix into the space 
XN(A) with respect to the norm |lz\|y = (2/Wz)"?, and B = O,X'Wy 
is a solution of 


lly — XA|%, = min ly — Xp\R,. 
BENW(A) 


(Proof as in Rae, 1972, p. 190). 


A.2. Asymptotics 417 
Se Es A re RT Ee ES Te ae (a 


A19. Let XEM x. 4€ Mixes We M>, and r (=) = k be true. Then 


X[X'WX + AA}! X’W isa projection matrix into the space XV(A) 
concerning the norm ||-||y and 


A[X'WS + A’A}? X' = 0, X'WX[X'WX 4+ A'A}} X'WK = X'WX 
holds. 
Furthermore, 8 = [X’WX + A’A]}! X'W y is a solution of 


lly — XB lly = min |ly — XBliiy 
BEN (A) 
(Proof: Consequence of Bunke and Bunke, (1986, [A 1.29]).) 
A 1.10. Tiny € Moret x ket is defined by 
ATi, Ae A € Mest 


Let Lin —— Lit}: For A E Mr sets B € eh obseop it holds that (A & B) dint 
= Tme(B ® A), Lot = Le) = Les Ten ig = Ter (MacRae, 1974; 
Magnus ae Neudecker, 1979). 


Level = tl 2 + Ijp). Lp is a projection matrix onto the linear 
space {A | 7 €M,} in R?*, r[,] = p(p + 1)/2. Let F, € iene 
Ee, hee 1) Oe ie dis = Inipsiyja- For A = (ei Wes me Aiea (ay, 0: 
yp, Ag9,..-, Ap, ---, Az). Then there isa C € Dee Cuil x (kde-+1)/2) Such sha 


A= CA, Ae Mg. 
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A2  Asymptoties 


A2.1.. For n = 0,1, 2,... let random vectors Z, with values in R* be given. 
If, for the sequence of the distribution functions F(z) = P{Z, < 2} 
of Z,, Fn(2) =>? Fo(z) is true for all continuity points of the limiting 
distribution function Fy), then the sequence of distributions of Z, 
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A 2.2. 


A 2.4. 


A 2.5. 


A. Appendices 


is called weakly convergent towards the distribution of Zp (notation: 
L(Ln) Fae? £(4Zo))- 


n—>0o 


We consider two sequences {Q,,} and {P,} of probability distributions 
Q, and P, over measurable spaces (%,, B,). The sequence {Q,} is 
said to be contiguous to the sequence {P,}, in case P,(B,) ===> 0 with 
B, € B, implies Q,(B,) =>=> 0. If @Q, and P, have the densities ¢,(x) 
and p,(x) with respect to a o-finite measure mu, over (Z,, B,), we use 
the same terminology for the sequences of the corresponding densities 


(Hajek and Sidak, 1967). 


(ist Lemma by Le Cam) 
If Q, and P, have the densities q,(x) and p,(x) with respect to a o- 
finite measure uw, over (Z,, B,), and if with 


n Inle) Ieper y ans 
1A Di Pn(X) 
ale) Ete if PnlX) =a Yn(2) = 0, 
++ 0° if Prl®%) = 0< Yn(X) 


and X, ~ P, for a positive constant b? ST lta( Xr, )) ee (5% b 


is true, then the sequence {Q,} is contiguous to the sequence {P,} 
(Hajek and Sidak, 1967). 
(3rd Lemma by Le Cam) 


Let (Q,} be contiguous to {P,} and with the notation of [A 2.3] let, 
for a sequence of statistics S,, the random vector (S,, 4,)’ under P,, 
asymptotically have the distribution. 


A (a ee 
Me} \G12 —2bUs 


Then, under Q,, S, is asymptotically N (4, + 0,2, 67) distributed 
(Hdjek ahd Sidak, 1967). 


For a density f with finite Fisher-information J(f) we consider the 
densities 


mies Tay ca) and ae po ) = Te 


for « = (a5, sisis's a) with d. =— ae and max (dn, haa d,)? 0) 
as well as N i=1 1Si<n a 


LG)S NG ay an) ee ee ee 0 =O? <0 
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ce ee i SY 


A 2.6. 


Lae & 


27* 


Furthermore, let , 


(n) 
Qa (x) ‘ 
a if ph” (ar SKIP 
s pi” (x) a) 
Lgl) = “emi —/ ln) 
1 if py (a) = ad (~) = 0, 
oo Tg a) 0 <e ge) 


for x € IR” denote the likelihood ratio and let T” denote the statistics 


% See 
Ty? = —D (dn, — dn) — In f(x) 
i=1 dx 


2=2,-d 
Then 


Pp) 


In L) — +>u Fy 0 


and 


f(a Ll) — > NV es 6, v) 


are true and thus, according to [A 2.3], {gq} is contiguous to {p\”} 
(Hajek and Sidak, 1967). 


Let {P,} and {Q,} be two sequences of probability distributions P,, 
Q, over the measurable spaces (%,,, 8,) with the densities p, and q, 
with respect to a o-finite measure mu, over (X,, B,,). The sequence {q,} 
is assumed to be contiguous to the sequence {p,}. Then to every ¢ > 0 
there is 6 > 0 such that for every sequence {B,} with B, € By, Q,(Bn) 
< «, holds for almost all n whenever F’,(B,) < 6 is satisfied for almost 
all n (Jureckovd, 1969). 


For n = 1,2,... let the measure spaces (%,, By, Mn) be given with 
o-finite measuses ,. For an open subset O of R* let P, = {Pyrs|A€ O} 
be a family of probability distributions over (%,, 8,) and let P, 


Ie ; : 
"? we denote a variant of the 


be dominated by up. BY pag = 
Radon-Nikodym density with respect to u,. For a sequence of matrices 
Sn € Mi, with ||S,|| 7 0, for #, = 0+ 8,h with a vector 


h € IR* and X, ~ Png the likelihood ratio 


PaolXn) it pig(X,) > 0, 
L,(h, 9, X_) = 4 Pno(Xn) 


0 if Pno(Xn) = 0 


is uniquely defined for almost all n. 
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Tf, for all h € IR* and # € © there are a positive definite matrix J(#) 
and sequences of functions 4,(#) and y,(h, #) with 
A,() : (Xn Bn) > (IR*, B*) 
palh, 9) + (Lqy Bn) > (RY, B") 
such that , 

1 
L,(h, 8, X,) = exp {ee 2(9) An(3) (Xn) — Gil) Barta Noe, x) 
with 
L£{A,(9) | 8} Four Ne(Oe, Le) 
and 
wp, (h, 0) 2+ 0 forall ®E€ O, hE R*, 
then the sequence {#,} is called locally asymptotically normal (e.g. 
Ibragimov and Khasminski, 1981). Here IR* denotes the real line 
extended by {—oo, oo}, and 8! the corresponding Borel o-algebra. 

A 2.8. With the notation of [A 2.7] let {P,} be locally asymptotically normal, 
and let the densities p,, be continuous in #. Then there is a Lebesgue 
zero set WO such that for any sequence of estimators d,=6,(X. 
with £{S;,"(9, — 9} ==—> N,(0,, V(d)) the inequality 
V(d) => I-1(9) is fulfilled for 09 € ON J. 

(Bahadur, 1967; Roussas (1972); Ibragimov and Khasminski, 1981). 

A 2.9. Let (X,%, ) be a measure space and {P, | #€ O}, O—R* be a para- 


metric family of probability distributions on (2, 8) dominated by wu. 
Let (Zn, Bn, Png) denote the n-fold product of (%, 6, Ps) with itself, 
and {P,,} be the corresponding sequence of distribution families. Let 


dP = 
© be an open subset of R*, f(-, 3) = ia (-), and © the closure of 9 
Mu 


in the one-point compactification of IR*. Let f(x, ) be defined for 
# € O and continuous in # on @ (for u - almost all x €\%). Let &, denote 
a maximum likelihood estimator of #, based on observations x, ..., Xp, 
ie. d,, is a measurable solution of 


I (xi, 8 lay 335 0) = sup iI f(aj, 0). 
6B i=l 
(a) Under the conditions: 
1. For every y > 0 and % € @ it holds that 
inf sf (Pa, 8) — fw, 5)? du > 0; 


9€0:|9—d|>y & 
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A 2.10. 


(b 


~~ 


(c) 


2. For all 8 € O 


sup (f(x, 8) — fu?(x, + h))? du => 0; 


X h+9eB,|nli<s 


and 


3. If O is unbounded, then for all 8 ¢€ O 
lim f sup (f2(a, 8) p2(a, 8+ h)) du <1 


dco L 9+heb,||h||>6 


d, is a consistent estimator for 0, (Ibragimov and Khasminski, 
1981). 


The conditions 


4. f(x, 3) is twice continuously differentiable in 0 


f(x, 9) 
5. [eet #) du => OF ks eat du = On scks OE 0, 
se Zi 


6. J(P) := Hy( In f(X, 8) (a5 In f(X, 9)’ exists and is positive 
definite for & € @. 


7. For all & € @ there is a function h(x) and a 6 > 0 with 
2 
up (: In f(x, 3) ) 
o=9 


5€0:||8—d||<6 08, 08; 
and Eyh(X) < oo 

with I(9) := J(d), S, = n-¥2I, ensure local asymptotic normality 

of the sequence {P,} (Bahadur, 1967; Ibragimov and Khasminski, 

1981). 


| < h(x) 


n—>oCo 


efficient in the sense of [A 2.8]) in case the conditions 1, ..., 7 are 
satisfied (Witting and Nolle, 1970). 


£{n(d, — 3) ——+ N(0, J(9)*) holds (i.e. 4, is asymptotically 


Let f : R*IR*° — R?® be a function which is continuously differen- 
tiable on IR’ XIR°, {X,} be a sequence of r-dimensional random vectors, 
and {Zo,} with Zo, = [Xon| Yo,] a sequence of random vectors with 
values in JIR*** with the properties 


£{Yn(X, — Xon)} sour N(O, A) 


and 


n—> Co 


Zon —> Zo = (Xo! Yol- 
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A 2.11. 


A 2.12. 


Then, with H = 0x,f(Xo, Yo) and B = HAH’ 
£{V/n(f(Xn Yon) — f(Xon» You))} =sa> N(O, B) 
holds. 


Assume given for m = mp, m + 1,...,a sequence of integers {n(m)}, 
a sequence of random matrices {Q,,} with values in It> and a sequence 
of nonrandom matrices An € Mnyim)xp be given. Furthermore, let 
n(m) 


—— « with OS «<1 and m144,,An Goer A EME be ful- 


m= m-—>co 


m 
filled. Let mQ,, be W,(n(m), Py An) distributed (for the definition of the 
noncentral Wishart distribution see Bunke and Bunke, 1986, [A 2.13]). 
Then it follows that 


{mos (2 AS OSULS 5 dk m4nAn)t 


m 
ar Np xp(0, 20(Z ® Z) Ip + 40,(A @ Z)T,). 


m—>oo 


(I, is defined in [A 1.10]). 


Let (2%, 8) be a measurable space and ? = {P3|9€ O}, OCR* bea 
parametric distribution family on (%, 8). Let O be an open subset of 
the IR‘ and denote by @ the closure of @ in the one-point compactifi- 
cation of IR*. Let IR} denote the real line extended by {—oo, oo}, and 
%! the corresponding o-algebra of Borel sets. A family of functions 
g(-, t) : (£, B) — (IR4, B1) for 7 ¢ O is called a family of contrast 
functions for P if Esg(X, t) (KX ~ Py») exists for all 8€ O, r€ © and if 


Eyg(X, 9) < Es(X, t) VO € 0,7 € O,O +t. 


Let X;, 7 = 1,...,2 be independent random variables with distri- 
bution Py», 8 € O, and let (2, 8,,) denote the n-fold product of (%, B) 
with itself. A minimum contrast estimator $9 for 9, based on the obser- 
vations X,,..., X,, is a (X,, 8,) measurable solution of 


n n 

Vg Xj, 9%) = inf Y 9(X;, 9). 

i=1 8€O t=1 

Let F denote a set of families of contrast functions. Under certain 
regularity conditions on F and P (see Pfanzagl, 1969, Michel and 
Pfanzagl, 1971; Pfanzagl, 1973) it holds that: 


(a) for g € F there exists almost surely a minimum contrast esti- 
mator 3% for almost all n, and 


£{ni!2(92 — 8)} <> N;,(0,, V9(9)), 


n—co 


V8) = (V3(8))-? V9) (V5(8))-2, 
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EE ee ee ee 


A 2.13. 


V4(0) := Eo(d0g(X, 9)’ Aog(X, 9), 


g s SOE e 
V3(0) := —E (( 20, 50, g(X, »)) : 


(b) Let P be dominated by a o-finite measure mw. There exists a 
family of contrast functions f)(-, 7), t € © and a version of the 


dP 
u-density f(-, 3) = ae &€ 0, such that fo(-, 8) = —In f(-, 8), 
lt 


8 € O. If $4 exists, this estimator is a maximum likelihood esti- 
mator. 


(c) If fo(-, -) € F, then it holds that 
Vio) 2 VEO) (VEO), 86 @, ig.eF. 


We consider the linear rank statistic 
n 
Sn = DY (Ca, — Cn) An(R,,), Where Ry,,..., Ry, denote the ranks of 
i=1 


the independent random variables X,,,...,Xn,. For a function 9, 
which is square integrable on the interval [0, 1], let the numbers a,(?) 
fulfil the relation 


1 


f (@n(1 + [en]) + o())? dt =e 0 


0 


and the constants ¢,), ..., Cnn fulfil the condition 


n 
1Sisn = 
Ce Ce 
=1 


For a density fy with finite Fisher information J(f,) let the random 
vector (X,,..., X,)’ have the density 


Gna = [1 foe: —4,,), m= 1,2,..: 
4=1 
and let 


TE NDC see On tegen 8 0 209 00: 
A 


l 


Then S, is asymptotically N(q,, 05) distributed with 


n 1 


Mae = 2g; (Cn, — ¢,] (dn, oe dn) | p(t) vlé, fo) dé, 


i=1 0 
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A 2.15. 


A. Appendices 


o2 = D (en, — bn)® f (ptt) — #)? at 


d 
pt, f) me Sade: In f(2)|z=7-W> Oa and 


1 . 
@ = | glt) dt, (Hajek and Sidak, 1967). 
0 

Let X,,, ..., Xn» be independently and identically distributed random 
variables with the absolutely continuous distribution function F. In 
[A 2.13] let especially a,(7) = Hg(U), where U denotes the 7th 
order statistics of a sample of size n with respect to the uniform 
distribution R[0, 1] (cf. Bunke and Bunke, 1986, [A 2.35]). For the 
quadratic integrable function from [A 2.13] let 


5 (p(t) — 9) dt > 0 


0 


be true. Then, the condition 


aoa et 
(Cn, Cn) ) (9) 


implies 

E(S, — T,)? 

AS, — Ts 9 (Hajek and Sidak, 1967) 
% (Cn, — €n)? 

t=1 


Let Xy,,...,Xyy be a sample to the Si fo with the finite Fisher 
information I(fy) and let XY) < X®) <...< XW denote the ordered 


sample. For a sequence {ey} of positive Hema bers €y With ¢, ———> 0 and 


m—>oo 
N1/4¢2, ———+ 00 we put my = [N®/4e,?] and ny = [N1/4c3] as well as 


qN 
= 


= 
Myny hy,+my hy, —my) 1— 
ie {|x yt mw) x 4 | 1 


n— 


b= | for 15) S ny and 


N+ Hl 
— [xGreatee) Se adherent 


N 
h 
if Au st < Aha, j= 13k. 


O otherwise. 
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cE ee a ee eS 


A 2.16. 


Tea 


A 2.18. 


A 2.19. 


A 2.20. 


(Here [gq] denotes for each g € R! the greatest integer that does not 
exceed q) 
G y(t) is ‘a consistent estimator for 


4) 
v(t, fo) = mee In fo(2)|,— ry (0<t<1) 


in the sense that the integral 


j [Bv(t) — lt, fo)? dt 
converges to zero in probability if we have the density 


n 
=[[folw:) (Hajek and Sidak, 1967). 
i=1 
For n = 1,2,... we consider a sequence of tests oy, for the testing 
problem 


H:8€ Og with Oyg— @: 


Pn is called an asymptotic «-test if the condition lim sup Hyg, < « 
for all @ € Og. ae 


(Lemma of Fatou) If {f,} is a sequence of nonnegative p-integrable 
functions with lim inf / fn du < oo, then the function f(x) = lim inf f,(x) 


n—>oco n—>0o 


is u-integrable and 


f fe <lim inf [ j,du — (Halmos, 1950). 
gee 


(Lebesgue theorem) Let {f,} be a sequence of y-integrable functions f, 
with f, —> f and let g be a y-integrable function with the property 
lfn(w)| <S g(x) u-almost-everywhere n= 1,2,... Then f, to, is p- 
integrable and 


Lindusas> [du (Halos, 1950). 


Let & be a Euclidean space and © a compact subset of a Euclidean 
a If g is a bounded and continuous function on % XO and if 
F(x) ——> F(a) is valid for all x € & for distribution functions F,, 


n—>0o 


F,,...and F on &, then the integral J9 (x, 0) dF,(x) tends uni- 


formly in #€ O to the limit f gle, oD) dF(x) (Jennrich, 1969). 
Sf 


(Generalized lemma by Chow) Let {X;} be a sequence of independently 
and identically distributed random vectors with values in IR* and 
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EX, = 0, E\|X;\)2 < ov. For an arbitrary array of nonrandom matrices 
G,, «= 1,...,k,, n = 1, 2, ...) we then have 


kn 
y Gn,Xi 
tet 8S (hows 1066). 
kn 
mY Gall 
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A3 Addenda — 


‘“A3.1. Let g be a real-valued function on Z x O, where O is a compact subset 
of a Euclidean space and X a measurable space. For each @ from @ let 
g(x, #) be a measurable function in x and for each fixed x € Z& let it 


be continuous in # ¢€ @. Then there exists a measurable mapping 
6: X +O with 


g(x, 8) = sup g(x, 3) (Jennrich, 1969). 


bEO 


A 3.2. A sequence {Y,} of random variables is said to be a martingale with 
respect to a sequence {%,} of o-algebras 8, with 8, € B,,, if 


E(V ns | Bn) = Ya; n= 1,2,... 
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A 3.3. 


A 3.4. 


A 3.5. 


A 3.6. 


A 3.7. 


Let {X,} be a sequence of random variables and {%,} be a sequence 

of nondecreasing o-algebras with H(X,| 8,4.) =0, n=1,2,... 

(Bo = {®, Q}). Then {¥e = Sx} is a martingale with respect to 
i=1 

{B,} and for n = 1,2,... we have the extended Kolmogorov ine- 

quality 


1 n \ 
P| max |Y,| > ds =5 > D{X,} for each « > 0 (Loéve, 1955). 

1<kSn i=1 
Let Py and A, be two absolutely continuous disjoint families of 
probability distributions with respect to a o-finite measure. If there 
exists a pair (Qo,Qi) € Po X P, such that for the uniformly most 
powerful «-test y* for Qo against Q,, 


Eo,p* = sup Eyy* 
OEP, 

and 

Eo,p* = inf Eo(¢*) 
QcP 


1 


hold, then g* maximizes inf Hyg in the class of all «-tests y for 
QeP, 


Hy: Po against H,: A, 


and @* is the only test with that property if the uniformly most power- 
ful «-test for Q) against Q, is unique (Lehmann, 1959). 


Let a family # of probability distributions P be given with the di- 
stribution functions Ff. Let X,,..., X, be a sample to the distribution 
function F(x — #) with 3 € R!. A set D* of estimators for # is said to 
be essentially complete in the generalized minimax sense if there exists 
for each estimator & an estimator O* € D* with 

sup R(d,d*) < sup R(d, 3). 
PeP,beR* PéP,ocTR* 
Here R(S, 3) denotes the risk function (Zacks, 1971). 


(Generalized theorem by Hunt and Stein) 

With the notation of [A 3.5] let the set D of all estimators S for 0 be 
a separable metric space with respect to a suitably chosen topology. 
Let the loss function L(%, d) be nonnegative and of such a kind that 
the set {d| L(9,d) < t} is compact for each 7 € IR'. Then the set D* 
of the equivariant estimators for # is essentially complete (Zacks, 1971). 


Let {7} = (i, ...,%,) and {j} = (j;, ..-, Jn) be permutations of (1, ..., ”). 
We say that {2 } is better ordered than {j} if for alla and b with a < 6 
and jq < 7p also tg < % follows. 
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A 3.8. 


A 3.9. 


A 3.10. 


A 3.11. 
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In case {7} is better ordered than {j}, then it holds for each m <n 
for the ordered m-tuples (é{m> +++» Timm) 200 (Jims +++» Imm) OF (tr, «++ bm) 
and (91, ---,Jm), respectively, that 


tintadim torall Laks m andall Lams n. 


If {c} is better ordered than {j} and if a, < ... < a, denote real num- 
bers, and h denotes a nondecreasing function, then 


> ayh(r,) = Dar ayh(j,) (Lehmann, 1966). 
k=1 k=1 


Let F be a distribution function and @ the set of continuously diffe- 
rentiable functions y on a compact set € with | p(x) dF(x) > 0. Then 
the functional 


2 
il 
6s J Vet slice Weeescre) oe ale 
ve —f p(x) dF 
is convex in F’ (Huber, 1969). 
Let abs random vector X = (X,,..., X,)’ have the density p(a, ..., x) 


= =I f(z;). Then the vector of the ranks R = (R,,...,R,) of X and 


a. eae X) = (XM,...,X™)’ of the order statistics have the 
following distributions: 


P{R = r} = — for re&, 


where & denotes the set of all permutations of the numbers (1, ..., 2) 
and X“) has the density 


n! [] f(x:) for 
i=1 

0 otherwise 

(Hdjek and Sidak, 1967). 


IW 
| 
= 


q(x) = 


As in [A 3.10], let X have the density p(x =I f(x;). Let the marginal 


density f be symmetric about zero. Then inet a ee of the sign sta- 
tistics sign X, R*, and |X|” are stochastically independent and we 
have 


LP (sign X =v) & for v € V, 


(U is the set of all n-vectors the components of which are either 
+1 or —1); 


A 3.12. 


A 3.13. 


A 3.14. 
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2. P{R* =1r} =— for re & (cf. [A 3.10]); 


3. |X|” has the density 


2°! [[ f(x;) Os oS ees wes 
i=1 


g(x) = 


(0) otherwise 


(Hajek and Sidak, 1967). 

With the numbers ¢,, ..., ¢, and a(1), ...,a(n) we consider the statis- 
ric S = > c,a(R;). Here R = (R,,...,R,)’ is a random vector with 
the anfforma distribution over the space & of all permutations of 
(1,...,). Then it follows that 


n 


ES = ie Fo, Seals) 
n 


+= 2s —1 


and 
1 n n ; © 
D{S} = —— > (¢; —@)*? D (a(j) — a)? 
n—1 i=1 j=1 
with 
p- 4) n = 1 n : = 
€é=—>dc¢ and @€=— Ya(z) (Hajek, 1969). 
N j=1 nN i=1 


Let f(z) be an one-dimensional density and {f(~— #)| 0 <¢ IR} the 
related family of densities with the location parameter # € IR!. We 
consider the class J of all densities f with the properties: 


1. f(x) is continuously differentiable in x; 
2. f @f(x) dx < oo; and 
Sane heer 0 


|z|—>oo 


1 : 
Among all densities f € F the density f*(~) =—= e-(”)*" has the 
smallest Fisher information 2x 


2 
LO) = Ve In f(x — ») f(a — 8) dx 
with constant variance o?. 


Let X = (X,,...,X,) be a random matrix with values in N,,.,, where 
the common distribution of the p-dimensional random vectors Xj, ..., 
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X,, is absolutely continuous with respect to the np-dimensional Lebes- 
gue measure. Let A be a real symmetric (n X)-matrix of rank r. 
Then, for the random matrix S := XAX’ the following statement 
holds almost surely: we have 7[S] = min (p, 7) and the nonvanishing 
eigenvalues of S are different (Okamoto, 1973). 


Let A € We, a > 0 and 

U2’) := a log det [2] 4+ tr[A2™?]. 
Then 

s = Ala = arg min I(2) 


LSEMF 


(ef. Rao, 1973, 8.a.5.8—10). 


Let &,, (for the definition of 2, , see [A 1.3]!) be endowed in a canoni- 
cal way with the structure of a compact differentiable manifold 
(Grassmann manifold; see Dieudonné, 1976,'ch. 16). Then: 


(a) For 1<k the mapping @: Mu_yxi > eu O(R) = ALL B)) 
is a homeomorphism of IM,_;),,; onto the open subset e(My_1) x7) 
of Qe 


(b) For a sequence of random variables {£m}mexy With values in 
eid & ns Lo, £y € &, holds if and only if there exists a se- 
quence of random variables {Lm}mey With values in Mi, and 
Do € Mi, such that AR(Ly) = Lm, MEN, A(Ly) = fo, and 
Lyn => Lp. 

(c) For O0<1l<k the mapping 0: & > & x1, Off) := £1 is a 
homeomorphism. 


(a) The mapping uw: Mi. X Mer > Ler, (A, £) := ALF (where AS 
denotes the image of £ under the linear mapping A) is continuous. 


(e) There exists a uniquely determined measure « on the o-algebra $ 
of the Borel sets of &,,, for which o(&,,) = 1, «(B) = a(OB) for 
each B € $ and each orthogonal matrix O € Nti,., (Haar measure; 
see Dieudonné, 1975, ch. 14). 


(f) Let J be an arbitrary element of &,,,. Then the set {f € &) | £ 
+ J = R"} is open in &, (£ + J denotes the linear hull of the 
set £ uJ in R*). 


(g) For a given M € Mi, the set {f € &,| R(M) S F} is a connected 
subset of ,, ;. 
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A 3.17. 


A 3.18. 


A 3.19. 


In addition to the definitions of [A 1.5]—[A 1.7], let 
My = {2 € Mare |Z =((Zy))TBS. ((Zi))iz7 € MP, 


j=1,2 
FZ, S28 Me : ((Zy))e22 = 21+ 22, AT) — F,-HZ = N, 
My {= {M © Woon | R(M) S eis JVM re (Qin, Dr os 
Let a mapping 


fi: My} ME X Moxa X MZ 
£:L8y pg L+I=R? 


be defined as follows: if, for a random vector 2 and J € M%, 

% ~ Nn +p(On+p, ) holds, then let 

(a) SSH Say ee aes) 

for w= (1, Onxr) 20s © 2= (Op x0, In) Zs 

ee te, ep hw , De ND Hire ao ae | Pay 

The mapping f is injective and 

f(MZ) = MF x ME K (Le ME | R(L) = J}. 

Let x = (a@,,...,%,) be a k-dimensional random vector with indepen- 
dent components x; = yw; + &,7 = 1,...,k, # |lax||* << o. 

Leta: (ig, a) Ee 


er et Dia Ore. 01 

@, := (Ee?), Ui Sind peste d ne Mss Es}, 
Then for 

A = ((aj) ith k € My 


it holds that 


k 
Dex' Ax = ¥ ak(p; — 30) + 40’ diag [A] Au 


7=1 


+ 2sp[XALA] 4+ 4u’AL Ay 
(diag [A] = Diag [a1, ..., Gx])- 


Let « = w+ bea k-dimensional random vector with £ |lx||* < oo, 
Ha = p. Let 


W :— Hes’ © &é’, ® := He' & &&’, 2 Dee 
Then it holds that 
Daw’ = ¥ — SE’ + 2p’ © OG) LM, + 22,(u @ ©) + 40 (up’ @ 2) Le 
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(1, as in [A 1.10]). If # is normally distributed, then 
UF ES OS id) Pe oP = Oe ae 


A 3.20. A random variable X is said to be double-exponentially distributed with 
the parameters « and f if it has the density 


Oe seal 


with —co <a < +cooand0 <f<o. 
A random variable X has a logistic distribution with the parameters 
«x and £ if its distribution function has the form 


Fy(x) = (1 + exp [—fa — a])* 
with —co << « < +o and f > 0. 
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A4 Notation and terminology 


A4.1 Abbreviations 


BAN-estimation sequence best asymptotic normally distributed estimation 
sequence 

BILUE best inhomogeneous linear unbiased estimator 

BLUE best linear unbiased estimator 
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ida ge se ee eo ee 
BUE best unbiased estimator 
CIVE canonical instrumental variable estimator 
EVM _ errors-in-variables model 
GLSE _ generalized least squares estimator 
IV instrumental variable 
IVE instrumental variable estimator 
LIFU | linear functional relation 
LIFU+t LIFU with nonrandom unobservable variables 
LIFU- LIFU with random unobservable variables 
LIML MLE with limited information 
LRT likelihood ratio test 
LSE least squares estimator 
MCE minimum contrast estimator 
MLE = _ maximum likelihood estimator 
MLS solution of the likelihood equation 
MSE mean square error 
OLSE ordinary least squares estimator 
ORLSE orthogonal least squares estimator 
2SLS-estimator two-stage least squares estimator 
WILSA weighted inadequate least squares approximation 
WILSE weighted inadequate least squares estimator 
WLSE weighted least squares estimator 


A4.2 Vectors, matrices, spaces 


((mi;)), ((mj) ryt matrix with elements m;; 


(m,,---, 7M) matrix with columns m,, ..., m, 


((11;;)), Ms, Mo. Bes matrix with submatrices M ;; 


A@®B = ((a;B)) for A = ((a)) 
(Wii we Ls ais M,), Min) = (Mi; ae M,,)' for matrices M; 
(¢ = 1,..., m) with the same number of columns 


(M,)i=*-" = (M,}...1M,) for matrices M,(¢=1,...,n) with the same 
number of rows 


7[M] rank of the matrix M 
Maxk Set of all (n X &)-matrices with real elements 


axe (= {ME Maxe | AM) = 7} 
Mn set of symmetric matrices in Mr,» 
M= set of positive semidefinite matrices in M,, 


28 Nonlinear Regression 


434 A. Appendices i é 


\ 
WM set of positive definite matrices in Nc, 


IR” n-dimensional Euclidean space 


R>, R= set of positive and nonnegative real numbers, respectively 


N set of all natural numbers 
My set of all (x X p)-matrices with columns in a subset £ of IR” 
A(X) = {XP|B € R*} with X € M,., 


MX) = Marx) 
N({X) .= {pe R*| XP = 0,} with X May, 


L(x, ...,%,%) subspace of IR" generated by the column vectors 2; € IR” 


(Ges Te. .5K) 
M’ transpose of the matrix WM 
M = (M,)io1.....0 if M = (M,)i-*-* with M, € Myr (6 = 1,..., 2) 
mM = (Minh, if M = (Miica,....n With Wy € Mr (@= 1... 2) 
M- generalized inverse of M 
M+ Moore generalized inverse of M 


tr[M]_ trace of M 

det [M7] determinant of J 

A{M] ith eigenvalue of 

Amax{M] largest eigenvalue of M (= 4,[M]) 
Aminf M] smallest eigenvalue of M (= 4,[J1}) 


Diag (21;)i-3,...,n, Diag [,..., M,] block diagonal matrix with the diagonal 
matrices M; ‘ 


we, = M'AM with M € Myym and A € Mn 
|? = = ||, with A = £ 
ipe projection matrix (projector) onto the linear subspace £ € JR” with 


respect to the norm ||-||,, i-e. |lz — P$a|| = min |lz — yl|, (x € IR") 
yea 


I,,2 unity matrix of order n Xn 


ioe = Pin (or, if no confusion is possible, = P#) 

Py =P R(M) 

£, \ £2 orthogonal difference between /, and /, 

Nie _ orthogonal complement of £ 

Lt matrix L with index ‘ortho’, mostly with A(L+) = (A(L))+ 
: (1k TCS RE 


07,0 null-vector in JR" 
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Onxe, O null-matrix in M,,, 
ee Ay" WALD COM C  oa' Fee Migrma Aie, Wee, 


Sxy = S¥y (or, if no confusion is possible, = 4) 
Sx Ny 

Qxy.z = Sxz87'Szy = XPzY' 

Qx.z = Orx.z 

Sxyzg = Sxy —Qxv.2 = XPisY' 

Sx.z = Sxx.z 

Qxy.uv = XPp,yY' = Sxy~SpSov.v 

J = Up-q 04x (p-a] 

o ie [O.n-2) xa! I) 

Lz = ERS B] 

L; sa ae oe a] 

L_,, &- set of r-dimensional subspaces of R? (p = r) 
Me, union of all Q_, withg<r 
| Qe set of the /-dimensional subspaces of IR* 

0 map from Wy. (pq) > ap—q With o(B) = R(Lz) 


A4.3 Sets and functions 


Sq = (LE Sypg | LE Moxcp_p: if A(L) = F, then rf J'L] = g} 
ASB Aisa subset of # 

A-— B ACA, and there isana€é # witha¢ A 

A — # difference between sets 

AXA = {(a,b)|ac A, bE B 


A” =AXAX->+ XA (n-fold product of 4) 
BB” class of Borel sets in IR” 

By class of Borel subsets of the (Borel) set 4 
Aint set of all inner points of the set 4 


(X, 2%) measurable space 

f: 2% —-Y fisamap from Z inY 

{”, f; | jth component of the vector-valued function f, in particular jth com- 
ponent of a vector 

Ass partial derivative of the function f with respect to the 7th component 

ilo—o of O at the point & = % 
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0" f 
(or 


Outs Onf(Hr)> matrix of partial derivatives of f with respect 
Om(1) filma) --- O,(p) fi(ér) to w at the point w= m (f: Rt > Ry, 


) matrix of second partial derivatives of the function f with 
Outs respect to the components of # at the point } = 8 


: Hy & Rp Sr ie = [oO oP Py) 
Ou(1) falta) --- Ou(p) fa(ur)/ / 


; 1 
k,(u) ae a lle— pl/>- for given 2 € M= 
{vj}, {vi}iexy Sequence of the 2; 
A = x; — v_,, where 2;, x;_, are elements of a sequence {xj} icq 
Ax; = x; — X, where x; is an element of a sequence {2} cen 


min f(z), min {f(z)| z € 4} minimum of the function f over 4 
A f(z), max {f(z) |2z€ 4} maximum of the function f over 4 
ANS inf {f(z) | z€ 4} infimum of the function f over 4 

act f(z), sup {f(z) |z€ A} supremum of the function f over 4 
<7 min f(z) = 2* € A with f(z*) = min f(z) 


ze ZEA 
f(z)=> min! minimize f! 
if recA 
IDC eee reais (indicator function of the set 4) 
0, otherwise 
n 2 
TUN eet (n= Sw?) with 
‘ t=1 
Wi (W, 2.5, 0) © IR” anc 
Ys (1, Op Yn) eqk* 
|A| number of elements of 4 
[q] largest integer which is not greater than q € IR} 


A4.4 Random variables and models 


PY, f{y} distribution of y 

f{y | 8} distribution of y for a given parameter # 
prly=y, P*v conditional distribution of # given y = y 
y~P_ yis distributed according to P 

yOQP ywr~P for some PEP 

Ey = i yP¥(dy) (expectation of y) 
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Epy = { yP(dy) (expectation of y if y ~ P) 
E o =H PJ 
Boy = f y —= exp {—y?/2} dy 
27 

TR? 
Dy = Ely — Ey) (y — Hy)’ (covariance matrix of the random vector y) 
Dsy = Es(y — Egy) (y — Esy)’ (covariance matrix of the random vector 
Cov (#,y) = H(a@ — Ha) (y — Ey)’ (covariance matrix between « and y) 


E(T | %) conditional expectation of T with respect to $ 
K(T | y) conditional expectation of T under y 
°(L, kn = n> SY) wlU(a4) kat) 
t=1 
for 1,4: % > R}, w := {fw |¢ = 1,<..,2; 2 € N} 
= (is i) pag 


for | = (h,...,,)": 2 +R? and k = (h,...,h,)' : L > R! 
“(1,k) —lim (I, b)p 


“tle = “, Ln 
Sey a= LY) 


"ly —U, =n Swirly, — Ua)? 


t=1 
for y = (¥1,--- Yn)’ € IR® andl: Y — R! 
Py characteristic function of the random vector y 
L,(#) likelihood or log-likelihood function for the observation y 
St, 
Sty model as set of structures 
(= {Sty | x € I7}) 


structure with parameter x 


A4.5 Distributions and measures 


N,(u, £), N(u, X) p-dimensional normal distribution with expectation mw and 
covariance matrix 2’ 

Nap oe, x &) A) ag Np(M, 2 @ A) 

@ distribution function of (0, 1) 

W(p, £) I-dimensional Wishart distribution with p degrees of freedom and 
expectational matrix 2’ 

“ central y? distribution with p degrees of freedom 

ees (1 — x) quantile (upper «-point) of the distribution 7; 


29* 


438 A. Appendices» 


F,., central F-distribution with p, and p, degress of freedom 


F y:p,,p, (1 — «) quantile (upper «-point) of the distribution Fp, », 


n 
v1 X%, X ¥; product measure of the measures 7, 7. and 7, ..., ¥,, respectively 


w=1 
n 


yr = X », with y, =» (1 = Lema} 
i=1 
Ly Lebesgue measure on (IR!, $+) 
y<yp vis absolutely continuous with respect to wu 
/jo,1] restriction of the measure 4 to Byo,1; 


R[O, 1] rectangular distribution on [0, 1] 


A4.6 Convergence 


An —>—> 0,4, > the sequence {a,}nex_ converges to @ 


n—>0o 
P re 
— convergence in probability 
ay convergence almost surely 


P;, —P the sequence of probability measures {P;};-y converges in distribution 
to the probability measure P 

o(.) a, = 0(6,) if a,/b, => 0 

O(.) Gy, = O(b,) if a,/b, is bounded for all n 

Op,(-) n= Op,(Yn) if 2a/Yn 2 0 


Op(.) 2% = Op (y,) if for any 7 > 0 there exist some q < co and some 
Mo € IN with Pyf{|%a/Yn| SQ} 21 —7 foralln = ny 


lim sup, lim inf limit supremum, limit infimum, respectively 


A 4.7 Sample functions 


(a) Simple classified data (x; €« R®, 7 = 1,..., n) 


Xn) = (%i)ini,...,n 
n 

x, = 25; 

i=1 
i = a./n (sample mean of 2;,)) 
x = (x, —&.,..., %, — @.) (sample residuals) 
LD = S; (sample covariance matrix) 
D 2e8. 


PP 
RY-2 
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ms 

vi, => Liz 
j=1 
n 

xj = ay 
i=1 


8 

| 
Ms 
Mes 
& 


i=1 1=1 

%; = a; /m; 

a == / 10 

Xi. = (In, © V4, +++) Im, © &n) 

Xi. SNC Ry aaa ed 

X(n) = [%,, +--+ Xn 

ise = Sg; with & = x — Z%_ (sample covariance) 

Wx Se nhOn hp ae p 

W, = 8, for %; = x; — Z;, (sum of squares in the classes) 
B; = S; for %; = %;, — %, (sum of squares between the classes) 
ahs See) ia, CER 

We = W,, i vi, ¢ IR 


b, = Bz, if x; ¢€ R! 
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